| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  *  linux/mm/vmscan.c | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  *  Swap reorganised 29.12.95, Stephen Tweedie. | 
					
						
							|  |  |  |  *  kswapd added: 7.1.96  sct | 
					
						
							|  |  |  |  *  Removed kswapd_ctl limits, and swap out as many pages as needed | 
					
						
							|  |  |  |  *  to bring the system back to freepages.high: 2.4.97, Rik van Riel. | 
					
						
							|  |  |  |  *  Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com). | 
					
						
							|  |  |  |  *  Multiqueue VM started 5.8.00, Rik van Riel. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #include <linux/mm.h>
 | 
					
						
							|  |  |  | #include <linux/module.h>
 | 
					
						
							| 
									
										
											  
											
												include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
  http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.
2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).
   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
											
										 
											2010-03-24 17:04:11 +09:00
										 |  |  | #include <linux/gfp.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/kernel_stat.h>
 | 
					
						
							|  |  |  | #include <linux/swap.h>
 | 
					
						
							|  |  |  | #include <linux/pagemap.h>
 | 
					
						
							|  |  |  | #include <linux/init.h>
 | 
					
						
							|  |  |  | #include <linux/highmem.h>
 | 
					
						
							| 
									
										
											  
											
												memcg: add memory.pressure_level events
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-04-29 15:08:31 -07:00
										 |  |  | #include <linux/vmpressure.h>
 | 
					
						
							| 
									
										
										
										
											2006-09-27 01:50:00 -07:00
										 |  |  | #include <linux/vmstat.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/file.h>
 | 
					
						
							|  |  |  | #include <linux/writeback.h>
 | 
					
						
							|  |  |  | #include <linux/blkdev.h>
 | 
					
						
							|  |  |  | #include <linux/buffer_head.h>	/* for try_to_release_page(),
 | 
					
						
							|  |  |  | 					buffer_heads_over_limit */ | 
					
						
							|  |  |  | #include <linux/mm_inline.h>
 | 
					
						
							|  |  |  | #include <linux/backing-dev.h>
 | 
					
						
							|  |  |  | #include <linux/rmap.h>
 | 
					
						
							|  |  |  | #include <linux/topology.h>
 | 
					
						
							|  |  |  | #include <linux/cpu.h>
 | 
					
						
							|  |  |  | #include <linux/cpuset.h>
 | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | #include <linux/compaction.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | #include <linux/notifier.h>
 | 
					
						
							|  |  |  | #include <linux/rwsem.h>
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:09:04 -08:00
										 |  |  | #include <linux/delay.h>
 | 
					
						
							| 
									
										
										
										
											2006-06-27 02:53:33 -07:00
										 |  |  | #include <linux/kthread.h>
 | 
					
						
							| 
									
										
										
										
											2006-12-06 20:34:23 -08:00
										 |  |  | #include <linux/freezer.h>
 | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | #include <linux/memcontrol.h>
 | 
					
						
							| 
									
										
										
										
											2008-07-25 01:48:52 -07:00
										 |  |  | #include <linux/delayacct.h>
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | #include <linux/sysctl.h>
 | 
					
						
							| 
									
										
											  
											
												vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.
This resulted in the kernel hanging up when executing a loop of the form
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap
as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!
Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-04-14 15:22:12 -07:00
										 |  |  | #include <linux/oom.h>
 | 
					
						
							| 
									
										
										
										
											2011-05-20 12:50:29 -07:00
										 |  |  | #include <linux/prefetch.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | #include <asm/tlbflush.h>
 | 
					
						
							|  |  |  | #include <asm/div64.h>
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #include <linux/swapops.h>
 | 
					
						
							| 
									
										
										
										
											2013-09-30 13:45:16 -07:00
										 |  |  | #include <linux/balloon_compaction.h>
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:33 -08:00
										 |  |  | #include "internal.h"
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | #define CREATE_TRACE_POINTS
 | 
					
						
							|  |  |  | #include <trace/events/vmscan.h>
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | struct scan_control { | 
					
						
							|  |  |  | 	/* Incremented by the number of inactive pages that were scanned */ | 
					
						
							|  |  |  | 	unsigned long nr_scanned; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: bail out of direct reclaim after swap_cluster_max pages
When the VM is under pressure, it can happen that several direct reclaim
processes are in the pageout code simultaneously.  It also happens that
the reclaiming processes run into mostly referenced, mapped and dirty
pages in the first round.
This results in multiple direct reclaim processes having a lower
pageout priority, which corresponds to a higher target of pages to
scan.
This in turn can result in each direct reclaim process freeing
many pages.  Together, they can end up freeing way too many pages.
This kicks useful data out of memory (in some cases more than half
of all memory is swapped out).  It also impacts performance by
keeping tasks stuck in the pageout code for too long.
A 30% improvement in hackbench has been observed with this patch.
The fix is relatively simple: in shrink_zone() we can check how many
pages we have already freed, direct reclaim tasks break out of the
scanning loop if they have already freed enough pages and have reached
a lower priority level.
We do not break out of shrink_zone() when priority == DEF_PRIORITY,
to ensure that equal pressure is applied to every zone in the common
case.
However, in order to do this we do need to know how many pages we already
freed, so move nr_reclaimed into scan_control.
akpm: a historical interlude...
We tried this in 2004:
:commit e468e46a9bea3297011d5918663ce6d19094cf87
:Author: akpm <akpm>
:Date:   Thu Jun 24 15:53:52 2004 +0000
:
:[PATCH] vmscan.c: dont reclaim too many pages
:
:    The shrink_zone() logic can, under some circumstances, cause far too many
:    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
:    a large number of reclaimable pages on the LRU.
:    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
And we reverted it in 2006:
:commit 210fe530305ee50cd889fe9250168228b2994f32
:Author: Andrew Morton <akpm@osdl.org>
:Date:   Fri Jan 6 00:11:14 2006 -0800
:
:    [PATCH] vmscan: balancing fix
:
:    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
:
:      The shrink_zone() logic can, under some circumstances, cause far too many
:      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
:      hit a large number of reclaimable pages on the LRU.
:
:      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
:      reclaimed.
:
:    Problem is, this change caused significant imbalance in inter-zone scan
:    balancing by truncating scans of larger zones.
:
:    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
:    balancing algorithm would require that if we're scanning 100 pages of
:    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
:    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
:    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
:    harder than large ones.
:
:    Now I need to remember what the workload was which caused me to write this
:    patch originally, then fix it up in a different way...
And we haven't demonstrated that whatever problem caused that reversion is
not being reintroduced by this change in 2008.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-01-06 14:40:01 -08:00
										 |  |  | 	/* Number of pages freed so far during a call to shrink_zones() */ | 
					
						
							|  |  |  | 	unsigned long nr_reclaimed; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:10 -08:00
										 |  |  | 	/* How many pages shrink_list() should reclaim */ | 
					
						
							|  |  |  | 	unsigned long nr_to_reclaim; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 	unsigned long hibernation_mode; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	/* This context's GFP mask */ | 
					
						
							| 
									
										
										
										
											2005-10-21 03:18:50 -04:00
										 |  |  | 	gfp_t gfp_mask; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	int may_writepage; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 	/* Can mapped pages be reclaimed? */ | 
					
						
							|  |  |  | 	int may_unmap; | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:30 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-04-21 12:24:57 -07:00
										 |  |  | 	/* Can pages be swapped as part of reclaim? */ | 
					
						
							|  |  |  | 	int may_swap; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 	int order; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	/* Scan (total_size >> priority) pages at once */ | 
					
						
							|  |  |  | 	int priority; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * The memory cgroup that hit its limit and as a result is the | 
					
						
							|  |  |  | 	 * primary target of this reclaim invocation. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	struct mem_cgroup *target_mem_cgroup; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:31 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes | 
					
						
							|  |  |  | 	 * are scanned. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	nodemask_t	*nodemask; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #ifdef ARCH_HAS_PREFETCH
 | 
					
						
							|  |  |  | #define prefetch_prev_lru_page(_page, _base, _field)			\
 | 
					
						
							|  |  |  | 	do {								\ | 
					
						
							|  |  |  | 		if ((_page)->lru.prev != _base) {			\ | 
					
						
							|  |  |  | 			struct page *prev;				\ | 
					
						
							|  |  |  | 									\ | 
					
						
							|  |  |  | 			prev = lru_to_page(&(_page->lru));		\ | 
					
						
							|  |  |  | 			prefetch(&prev->_field);			\ | 
					
						
							|  |  |  | 		}							\ | 
					
						
							|  |  |  | 	} while (0) | 
					
						
							|  |  |  | #else
 | 
					
						
							|  |  |  | #define prefetch_prev_lru_page(_page, _base, _field) do { } while (0)
 | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | #ifdef ARCH_HAS_PREFETCHW
 | 
					
						
							|  |  |  | #define prefetchw_prev_lru_page(_page, _base, _field)			\
 | 
					
						
							|  |  |  | 	do {								\ | 
					
						
							|  |  |  | 		if ((_page)->lru.prev != _base) {			\ | 
					
						
							|  |  |  | 			struct page *prev;				\ | 
					
						
							|  |  |  | 									\ | 
					
						
							|  |  |  | 			prev = lru_to_page(&(_page->lru));		\ | 
					
						
							|  |  |  | 			prefetchw(&prev->_field);			\ | 
					
						
							|  |  |  | 		}							\ | 
					
						
							|  |  |  | 	} while (0) | 
					
						
							|  |  |  | #else
 | 
					
						
							|  |  |  | #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
 | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * From 0 .. 100.  Higher means more swappy. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int vm_swappiness = 60; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:35:48 -08:00
										 |  |  | unsigned long vm_total_pages;	/* The total number of pages which the VM controls */ | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | static LIST_HEAD(shrinker_list); | 
					
						
							|  |  |  | static DECLARE_RWSEM(shrinker_rwsem); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:43:02 -07:00
										 |  |  | #ifdef CONFIG_MEMCG
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | static bool global_reclaim(struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 	return !sc->target_mem_cgroup; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:29 -08:00
										 |  |  | #else
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | static bool global_reclaim(struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	return true; | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:29 -08:00
										 |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | unsigned long zone_reclaimable_pages(struct zone *zone) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	int nr; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	nr = zone_page_state(zone, NR_ACTIVE_FILE) + | 
					
						
							|  |  |  | 	     zone_page_state(zone, NR_INACTIVE_FILE); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (get_nr_swap_pages() > 0) | 
					
						
							|  |  |  | 		nr += zone_page_state(zone, NR_ACTIVE_ANON) + | 
					
						
							|  |  |  | 		      zone_page_state(zone, NR_INACTIVE_ANON); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return nr; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | bool zone_reclaimable(struct zone *zone) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	return zone->pages_scanned < zone_reclaimable_pages(zone) * 6; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:08 -07:00
										 |  |  | static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru) | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:16 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:52 -07:00
										 |  |  | 	if (!mem_cgroup_disabled()) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:08 -07:00
										 |  |  | 		return mem_cgroup_get_lru_size(lruvec, lru); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:19 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 	return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:16 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  |  * Add a shrinker callback to be called from the vm. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | int register_shrinker(struct shrinker *shrinker) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 	size_t size = sizeof(*shrinker->nr_deferred); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If we only have one possible node in the system anyway, save | 
					
						
							|  |  |  | 	 * ourselves the trouble and disable NUMA aware behavior. This way we | 
					
						
							|  |  |  | 	 * will save memory and some small loop time later. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (nr_node_ids == 1) | 
					
						
							|  |  |  | 		shrinker->flags &= ~SHRINKER_NUMA_AWARE; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (shrinker->flags & SHRINKER_NUMA_AWARE) | 
					
						
							|  |  |  | 		size *= nr_node_ids; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	shrinker->nr_deferred = kzalloc(size, GFP_KERNEL); | 
					
						
							|  |  |  | 	if (!shrinker->nr_deferred) | 
					
						
							|  |  |  | 		return -ENOMEM; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:17 -07:00
										 |  |  | 	down_write(&shrinker_rwsem); | 
					
						
							|  |  |  | 	list_add_tail(&shrinker->list, &shrinker_list); | 
					
						
							|  |  |  | 	up_write(&shrinker_rwsem); | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 	return 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:17 -07:00
										 |  |  | EXPORT_SYMBOL(register_shrinker); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * Remove one | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:17 -07:00
										 |  |  | void unregister_shrinker(struct shrinker *shrinker) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	down_write(&shrinker_rwsem); | 
					
						
							|  |  |  | 	list_del(&shrinker->list); | 
					
						
							|  |  |  | 	up_write(&shrinker_rwsem); | 
					
						
							| 
									
										
										
										
											2013-10-16 13:46:46 -07:00
										 |  |  | 	kfree(shrinker->nr_deferred); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:17 -07:00
										 |  |  | EXPORT_SYMBOL(unregister_shrinker); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | #define SHRINK_BATCH 128
 | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 
 | 
					
						
							|  |  |  | static unsigned long | 
					
						
							|  |  |  | shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, | 
					
						
							|  |  |  | 		 unsigned long nr_pages_scanned, unsigned long lru_pages) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long freed = 0; | 
					
						
							|  |  |  | 	unsigned long long delta; | 
					
						
							|  |  |  | 	long total_scan; | 
					
						
							|  |  |  | 	long max_pass; | 
					
						
							|  |  |  | 	long nr; | 
					
						
							|  |  |  | 	long new_nr; | 
					
						
							|  |  |  | 	int nid = shrinkctl->nid; | 
					
						
							|  |  |  | 	long batch_size = shrinker->batch ? shrinker->batch | 
					
						
							|  |  |  | 					  : SHRINK_BATCH; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:16 +10:00
										 |  |  | 	max_pass = shrinker->count_objects(shrinker, shrinkctl); | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 	if (max_pass == 0) | 
					
						
							|  |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * copy the current shrinker scan count into a local variable | 
					
						
							|  |  |  | 	 * and zero it so that other concurrent shrinker invocations | 
					
						
							|  |  |  | 	 * don't also do this scanning work. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	total_scan = nr; | 
					
						
							|  |  |  | 	delta = (4 * nr_pages_scanned) / shrinker->seeks; | 
					
						
							|  |  |  | 	delta *= max_pass; | 
					
						
							|  |  |  | 	do_div(delta, lru_pages + 1); | 
					
						
							|  |  |  | 	total_scan += delta; | 
					
						
							|  |  |  | 	if (total_scan < 0) { | 
					
						
							|  |  |  | 		printk(KERN_ERR | 
					
						
							|  |  |  | 		"shrink_slab: %pF negative objects to delete nr=%ld\n", | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:16 +10:00
										 |  |  | 		       shrinker->scan_objects, total_scan); | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 		total_scan = max_pass; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * We need to avoid excessive windup on filesystem shrinkers | 
					
						
							|  |  |  | 	 * due to large numbers of GFP_NOFS allocations causing the | 
					
						
							|  |  |  | 	 * shrinkers to return -1 all the time. This results in a large | 
					
						
							|  |  |  | 	 * nr being built up so when a shrink that can do some work | 
					
						
							|  |  |  | 	 * comes along it empties the entire cache due to nr >>> | 
					
						
							|  |  |  | 	 * max_pass.  This is bad for sustaining a working set in | 
					
						
							|  |  |  | 	 * memory. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * Hence only allow the shrinker to scan the entire cache when | 
					
						
							|  |  |  | 	 * a large delta change is calculated directly. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (delta < max_pass / 4) | 
					
						
							|  |  |  | 		total_scan = min(total_scan, max_pass / 2); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Avoid risking looping forever due to too large nr value: | 
					
						
							|  |  |  | 	 * never try to free more than twice the estimate number of | 
					
						
							|  |  |  | 	 * freeable entries. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (total_scan > max_pass * 2) | 
					
						
							|  |  |  | 		total_scan = max_pass * 2; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, | 
					
						
							|  |  |  | 				nr_pages_scanned, lru_pages, | 
					
						
							|  |  |  | 				max_pass, delta, total_scan); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	while (total_scan >= batch_size) { | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:16 +10:00
										 |  |  | 		unsigned long ret; | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:16 +10:00
										 |  |  | 		shrinkctl->nr_to_scan = batch_size; | 
					
						
							|  |  |  | 		ret = shrinker->scan_objects(shrinker, shrinkctl); | 
					
						
							|  |  |  | 		if (ret == SHRINK_STOP) | 
					
						
							|  |  |  | 			break; | 
					
						
							|  |  |  | 		freed += ret; | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		count_vm_events(SLABS_SCANNED, batch_size); | 
					
						
							|  |  |  | 		total_scan -= batch_size; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		cond_resched(); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * move the unused scan count back into the shrinker in a | 
					
						
							|  |  |  | 	 * manner that handles concurrent updates. If we exhausted the | 
					
						
							|  |  |  | 	 * scan, there is no need to do an update. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (total_scan > 0) | 
					
						
							|  |  |  | 		new_nr = atomic_long_add_return(total_scan, | 
					
						
							|  |  |  | 						&shrinker->nr_deferred[nid]); | 
					
						
							|  |  |  | 	else | 
					
						
							|  |  |  | 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr); | 
					
						
							|  |  |  | 	return freed; | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:27 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Call the shrink functions to age shrinkable caches | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Here we assume it costs one seek to replace a lru page and that it also | 
					
						
							|  |  |  |  * takes a seek to recreate a cache object.  With this in mind we age equal | 
					
						
							|  |  |  |  * percentages of the lru and ageable caches.  This should balance the seeks | 
					
						
							|  |  |  |  * generated by these structures. | 
					
						
							|  |  |  |  * | 
					
						
							| 
									
										
										
										
											2007-10-20 01:27:18 +02:00
										 |  |  |  * If the vm encountered mapped pages on the LRU it increase the pressure on | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * slab to avoid swapping. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * `lru_pages' represents the number of on-LRU pages in all the zones which | 
					
						
							|  |  |  |  * are eligible for the caller's allocation attempt.  It is used for balancing | 
					
						
							|  |  |  |  * slab reclaim versus page reclaim. | 
					
						
							| 
									
										
										
										
											2005-06-21 17:14:35 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * Returns the number of slab objects which we shrunk. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2013-08-28 10:17:56 +10:00
										 |  |  | unsigned long shrink_slab(struct shrink_control *shrinkctl, | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:27 -07:00
										 |  |  | 			  unsigned long nr_pages_scanned, | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 			  unsigned long lru_pages) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	struct shrinker *shrinker; | 
					
						
							| 
									
										
										
										
											2013-08-28 10:17:56 +10:00
										 |  |  | 	unsigned long freed = 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:27 -07:00
										 |  |  | 	if (nr_pages_scanned == 0) | 
					
						
							|  |  |  | 		nr_pages_scanned = SWAP_CLUSTER_MAX; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:11:11 -07:00
										 |  |  | 	if (!down_read_trylock(&shrinker_rwsem)) { | 
					
						
							| 
									
										
										
										
											2013-08-28 10:17:56 +10:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If we would return 0, our callers would understand that we | 
					
						
							|  |  |  | 		 * have nothing else to shrink and give up trying. By returning | 
					
						
							|  |  |  | 		 * 1 we keep it going and assume we'll be able to shrink next | 
					
						
							|  |  |  | 		 * time. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		freed = 1; | 
					
						
							| 
									
										
										
										
											2011-05-24 17:11:11 -07:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	list_for_each_entry(shrinker, &shrinker_list, list) { | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) { | 
					
						
							|  |  |  | 			if (!node_online(shrinkctl->nid)) | 
					
						
							|  |  |  | 				continue; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) && | 
					
						
							|  |  |  | 			    (shrinkctl->nid != 0)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				break; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: per-node deferred work
The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.
This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.
In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
											
										 
											2013-08-28 10:18:04 +10:00
										 |  |  | 			freed += shrink_slab_node(shrinkctl, shrinker, | 
					
						
							|  |  |  | 				 nr_pages_scanned, lru_pages); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	up_read(&shrinker_rwsem); | 
					
						
							| 
									
										
										
										
											2011-05-24 17:11:11 -07:00
										 |  |  | out: | 
					
						
							|  |  |  | 	cond_resched(); | 
					
						
							| 
									
										
										
										
											2013-08-28 10:17:56 +10:00
										 |  |  | 	return freed; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static inline int is_page_cache_freeable(struct page *page) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2009-09-21 17:03:00 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * A freeable page cache page is referenced only by the caller | 
					
						
							|  |  |  | 	 * that isolated the page, the page cache radix tree and | 
					
						
							|  |  |  | 	 * optional buffer heads at page->private. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2009-09-21 17:02:59 -07:00
										 |  |  | 	return page_count(page) - page_has_private(page) == 2; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | static int may_write_to_queue(struct backing_dev_info *bdi, | 
					
						
							|  |  |  | 			      struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:47 -08:00
										 |  |  | 	if (current->flags & PF_SWAPWRITE) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return 1; | 
					
						
							|  |  |  | 	if (!bdi_write_congested(bdi)) | 
					
						
							|  |  |  | 		return 1; | 
					
						
							|  |  |  | 	if (bdi == current->backing_dev_info) | 
					
						
							|  |  |  | 		return 1; | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * We detected a synchronous write error writing a page out.  Probably | 
					
						
							|  |  |  |  * -ENOSPC.  We need to propagate that into the address_space for a subsequent | 
					
						
							|  |  |  |  * fsync(), msync() or close(). | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * The tricky part is that after writepage we cannot touch the mapping: nothing | 
					
						
							|  |  |  |  * prevents it from being freed up.  But we have a ref on the page and once | 
					
						
							|  |  |  |  * that page is locked, the mapping is pinned. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * We're allowed to run sleeping lock_page() here because we know the caller has | 
					
						
							|  |  |  |  * __GFP_FS. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | static void handle_write_error(struct address_space *mapping, | 
					
						
							|  |  |  | 				struct page *page, int error) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-03-10 08:52:07 +01:00
										 |  |  | 	lock_page(page); | 
					
						
							| 
									
										
										
										
											2007-05-08 00:23:25 -07:00
										 |  |  | 	if (page_mapping(page) == mapping) | 
					
						
							|  |  |  | 		mapping_set_error(mapping, error); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	unlock_page(page); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:38 -07:00
										 |  |  | /* possible outcome of pageout() */ | 
					
						
							|  |  |  | typedef enum { | 
					
						
							|  |  |  | 	/* failed to write page out, page is locked */ | 
					
						
							|  |  |  | 	PAGE_KEEP, | 
					
						
							|  |  |  | 	/* move page to the active list, page is locked */ | 
					
						
							|  |  |  | 	PAGE_ACTIVATE, | 
					
						
							|  |  |  | 	/* page has been sent to the disk successfully, page is unlocked */ | 
					
						
							|  |  |  | 	PAGE_SUCCESS, | 
					
						
							|  |  |  | 	/* page is clean and locked */ | 
					
						
							|  |  |  | 	PAGE_CLEAN, | 
					
						
							|  |  |  | } pageout_t; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
											
												[PATCH] vmscan: rename functions
We have:
	try_to_free_pages
	->shrink_caches(struct zone **zones, ..)
	  ->shrink_zone(struct zone *, ...)
	    ->shrink_cache(struct zone *, ...)
	      ->shrink_list(struct list_head *, ...)
	    ->refill_inactive_list((struct zone *, ...)
which is fairly irrational.
Rename things so that we have
 	try_to_free_pages
 	->shrink_zones(struct zone **zones, ..)
 	  ->shrink_zone(struct zone *, ...)
 	    ->shrink_inactive_list(struct zone *, ...)
 	      ->shrink_page_list(struct list_head *, ...)
	    ->shrink_active_list(struct zone *, ...)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-03-22 00:08:21 -08:00
										 |  |  |  * pageout is called by shrink_page_list() for each dirty page. | 
					
						
							|  |  |  |  * Calls ->writepage(). | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | static pageout_t pageout(struct page *page, struct address_space *mapping, | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | 			 struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If the page is dirty, only perform writeback if that write | 
					
						
							|  |  |  | 	 * will be non-blocking.  To prevent this allocation from being | 
					
						
							|  |  |  | 	 * stalled by pagecache activity.  But note that there may be | 
					
						
							|  |  |  | 	 * stalls if we need to run get_block().  We could test | 
					
						
							|  |  |  | 	 * PagePrivate for that. | 
					
						
							|  |  |  | 	 * | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:49 -08:00
										 |  |  | 	 * If this process is currently in __generic_file_aio_write() against | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	 * this page's queue, we can perform writeback even if that | 
					
						
							|  |  |  | 	 * will block. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * If the page is swapcache, write it back even if that would | 
					
						
							|  |  |  | 	 * block, for some throttling. This happens by accident, because | 
					
						
							|  |  |  | 	 * swap_backing_dev_info is bust: it doesn't reflect the | 
					
						
							|  |  |  | 	 * congestion state of the swapdevs.  Easy to fix, if needed. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!is_page_cache_freeable(page)) | 
					
						
							|  |  |  | 		return PAGE_KEEP; | 
					
						
							|  |  |  | 	if (!mapping) { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Some data journaling orphaned pages can have | 
					
						
							|  |  |  | 		 * page->mapping == NULL while being dirty with clean buffers. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2009-04-03 16:42:36 +01:00
										 |  |  | 		if (page_has_private(page)) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			if (try_to_free_buffers(page)) { | 
					
						
							|  |  |  | 				ClearPageDirty(page); | 
					
						
							| 
									
										
										
										
											2008-04-30 00:55:07 -07:00
										 |  |  | 				printk("%s: orphaned page\n", __func__); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				return PAGE_CLEAN; | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 		return PAGE_KEEP; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	if (mapping->a_ops->writepage == NULL) | 
					
						
							|  |  |  | 		return PAGE_ACTIVATE; | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:45 -07:00
										 |  |  | 	if (!may_write_to_queue(mapping->backing_dev_info, sc)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return PAGE_KEEP; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (clear_page_dirty_for_io(page)) { | 
					
						
							|  |  |  | 		int res; | 
					
						
							|  |  |  | 		struct writeback_control wbc = { | 
					
						
							|  |  |  | 			.sync_mode = WB_SYNC_NONE, | 
					
						
							|  |  |  | 			.nr_to_write = SWAP_CLUSTER_MAX, | 
					
						
							| 
									
										
											  
											
												[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request.  Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
    - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
      -1 is usually ok for range_end (type is long long). But, if someone did,
		range_end += val;		range_end is "val - 1"
		u64val = range_end >> bits;	u64val is "~(0ULL)"
      or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
      things, and uses LLONG_MAX for range_end.
    - All callers of ->writepages() sets range_start/end or range_cyclic.
    - Fix updates of ->writeback_index. It seems already bit strange.
      If it starts at 0 and ended by check of nr_to_write, this last
      index may reduce chance to scan end of file.  So, this updates
      ->writeback_index only if range_cyclic is true or whole-file is
      scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-06-23 02:03:26 -07:00
										 |  |  | 			.range_start = 0, | 
					
						
							|  |  |  | 			.range_end = LLONG_MAX, | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			.for_reclaim = 1, | 
					
						
							|  |  |  | 		}; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		SetPageReclaim(page); | 
					
						
							|  |  |  | 		res = mapping->a_ops->writepage(page, &wbc); | 
					
						
							|  |  |  | 		if (res < 0) | 
					
						
							|  |  |  | 			handle_write_error(mapping, page, res); | 
					
						
							| 
									
										
										
										
											2005-12-15 14:28:17 -08:00
										 |  |  | 		if (res == AOP_WRITEPAGE_ACTIVATE) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			ClearPageReclaim(page); | 
					
						
							|  |  |  | 			return PAGE_ACTIVATE; | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		if (!PageWriteback(page)) { | 
					
						
							|  |  |  | 			/* synchronous write or broken a_ops? */ | 
					
						
							|  |  |  | 			ClearPageReclaim(page); | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | 		trace_mm_vmscan_writepage(page, trace_reclaim_flags(page)); | 
					
						
							| 
									
										
										
										
											2006-09-27 01:50:00 -07:00
										 |  |  | 		inc_zone_page_state(page, NR_VMSCAN_WRITE); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return PAGE_SUCCESS; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return PAGE_CLEAN; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-10-17 00:09:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  |  * Same as remove_mapping, but if the page is removed from the mapping, it | 
					
						
							|  |  |  |  * gets returned with a refcount of 0. | 
					
						
							| 
									
										
										
										
											2006-10-17 00:09:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | static int __remove_mapping(struct address_space *mapping, struct page *page) | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:23 -07:00
										 |  |  | 	BUG_ON(!PageLocked(page)); | 
					
						
							|  |  |  | 	BUG_ON(mapping != page_mapping(page)); | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:32 -07:00
										 |  |  | 	spin_lock_irq(&mapping->tree_lock); | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2006-09-27 01:50:02 -07:00
										 |  |  | 	 * The non racy check for a busy page. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * Must be careful with the order of the tests. When someone has | 
					
						
							|  |  |  | 	 * a ref to the page, it may be possible that they dirty it then | 
					
						
							|  |  |  | 	 * drop the reference. So if PageDirty is tested before page_count | 
					
						
							|  |  |  | 	 * here, then the following race may occur: | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * get_user_pages(&page); | 
					
						
							|  |  |  | 	 * [user mapping goes away] | 
					
						
							|  |  |  | 	 * write_to(page); | 
					
						
							|  |  |  | 	 *				!PageDirty(page)    [good] | 
					
						
							|  |  |  | 	 * SetPageDirty(page); | 
					
						
							|  |  |  | 	 * put_page(page); | 
					
						
							|  |  |  | 	 *				!page_count(page)   [good, discard it] | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * [oops, our write_to data is lost] | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * Reversing the order of the tests ensures such a situation cannot | 
					
						
							|  |  |  | 	 * escape unnoticed. The smp_rmb is needed to ensure the page->flags | 
					
						
							|  |  |  | 	 * load is not satisfied before that of page->_count. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * Note that if SetPageDirty is always performed via set_page_dirty, | 
					
						
							|  |  |  | 	 * and thus under tree_lock, then this ordering is not required. | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 	if (!page_freeze_refs(page, 2)) | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 		goto cannot_free; | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */ | 
					
						
							|  |  |  | 	if (unlikely(PageDirty(page))) { | 
					
						
							|  |  |  | 		page_unfreeze_refs(page, 2); | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 		goto cannot_free; | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	if (PageSwapCache(page)) { | 
					
						
							|  |  |  | 		swp_entry_t swap = { .val = page_private(page) }; | 
					
						
							|  |  |  | 		__delete_from_swap_cache(page); | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:32 -07:00
										 |  |  | 		spin_unlock_irq(&mapping->tree_lock); | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:52 -07:00
										 |  |  | 		swapcache_free(swap, page); | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 	} else { | 
					
						
							| 
									
										
										
										
											2010-12-01 13:35:19 -05:00
										 |  |  | 		void (*freepage)(struct page *); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		freepage = mapping->a_ops->freepage; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-03-22 16:32:44 -07:00
										 |  |  | 		__delete_from_page_cache(page); | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:32 -07:00
										 |  |  | 		spin_unlock_irq(&mapping->tree_lock); | 
					
						
							| 
									
										
										
										
											2009-05-28 14:34:28 -07:00
										 |  |  | 		mem_cgroup_uncharge_cache_page(page); | 
					
						
							| 
									
										
										
										
											2010-12-01 13:35:19 -05:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		if (freepage != NULL) | 
					
						
							|  |  |  | 			freepage(page); | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return 1; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | cannot_free: | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:32 -07:00
										 |  |  | 	spin_unlock_irq(&mapping->tree_lock); | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Attempt to detach a locked page from its ->mapping.  If it is dirty or if | 
					
						
							|  |  |  |  * someone else has a ref on the page, abort and return 0.  If it was | 
					
						
							|  |  |  |  * successfully detached, return 1.  Assumes the caller has a single ref on | 
					
						
							|  |  |  |  * this page. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int remove_mapping(struct address_space *mapping, struct page *page) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	if (__remove_mapping(mapping, page)) { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Unfreezing the refcount with 1 rather than 2 effectively | 
					
						
							|  |  |  | 		 * drops the pagecache ref for us without requiring another | 
					
						
							|  |  |  | 		 * atomic operation. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		page_unfreeze_refs(page, 1); | 
					
						
							|  |  |  | 		return 1; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * putback_lru_page - put previously isolated page onto appropriate LRU list | 
					
						
							|  |  |  |  * @page: page to be put back to appropriate lru list | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Add previously isolated @page to appropriate LRU list. | 
					
						
							|  |  |  |  * Page may still be unevictable for other reasons. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * lru_lock must not be held, interrupts must be enabled. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | void putback_lru_page(struct page *page) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 	bool is_unevictable; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:40 -07:00
										 |  |  | 	int was_unevictable = PageUnevictable(page); | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	VM_BUG_ON(PageLRU(page)); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | redo: | 
					
						
							|  |  |  | 	ClearPageUnevictable(page); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 	if (page_evictable(page)) { | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * For evictable pages, we can use the cache. | 
					
						
							|  |  |  | 		 * In event of a race, worst case is we end up with an | 
					
						
							|  |  |  | 		 * unevictable page on [in]active list. | 
					
						
							|  |  |  | 		 * We know how to handle that. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 		is_unevictable = false; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:34 -07:00
										 |  |  | 		lru_cache_add(page); | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 	} else { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Put unevictable pages directly on zone's unevictable | 
					
						
							|  |  |  | 		 * list. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 		is_unevictable = true; | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		add_page_to_unevictable_list(page); | 
					
						
							| 
									
										
										
										
											2009-10-26 16:50:00 -07:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:28 -07:00
										 |  |  | 		 * When racing with an mlock or AS_UNEVICTABLE clearing | 
					
						
							|  |  |  | 		 * (page is unlocked) make sure that if the other thread | 
					
						
							|  |  |  | 		 * does not observe our setting of PG_lru and fails | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 		 * isolation/check_move_unevictable_pages, | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:28 -07:00
										 |  |  | 		 * we see PG_mlocked/AS_UNEVICTABLE cleared below and move | 
					
						
							| 
									
										
										
										
											2009-10-26 16:50:00 -07:00
										 |  |  | 		 * the page back to the evictable list. | 
					
						
							|  |  |  | 		 * | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:28 -07:00
										 |  |  | 		 * The other side is TestClearPageMlocked() or shmem_lock(). | 
					
						
							| 
									
										
										
										
											2009-10-26 16:50:00 -07:00
										 |  |  | 		 */ | 
					
						
							|  |  |  | 		smp_mb(); | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * page's status can change while we move it among lru. If an evictable | 
					
						
							|  |  |  | 	 * page is on unevictable list, it never be freed. To avoid that, | 
					
						
							|  |  |  | 	 * check after we added it to the list, again. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 	if (is_unevictable && page_evictable(page)) { | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		if (!isolate_lru_page(page)) { | 
					
						
							|  |  |  | 			put_page(page); | 
					
						
							|  |  |  | 			goto redo; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 		/* This means someone else dropped this page from LRU
 | 
					
						
							|  |  |  | 		 * So, it will be freed or putback to LRU again. There is | 
					
						
							|  |  |  | 		 * nothing to do here. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 	if (was_unevictable && !is_unevictable) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:40 -07:00
										 |  |  | 		count_vm_event(UNEVICTABLE_PGRESCUED); | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:26 -07:00
										 |  |  | 	else if (!was_unevictable && is_unevictable) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:40 -07:00
										 |  |  | 		count_vm_event(UNEVICTABLE_PGCULLED); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 	put_page(page);		/* drop ref from isolate */ | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | enum page_references { | 
					
						
							|  |  |  | 	PAGEREF_RECLAIM, | 
					
						
							|  |  |  | 	PAGEREF_RECLAIM_CLEAN, | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 	PAGEREF_KEEP, | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 	PAGEREF_ACTIVATE, | 
					
						
							|  |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | static enum page_references page_check_references(struct page *page, | 
					
						
							|  |  |  | 						  struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 	int referenced_ptes, referenced_page; | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 	unsigned long vm_flags; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:25 -07:00
										 |  |  | 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, | 
					
						
							|  |  |  | 					  &vm_flags); | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 	referenced_page = TestClearPageReferenced(page); | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Mlock lost the isolation race with us.  Let try_to_unmap() | 
					
						
							|  |  |  | 	 * move the page to the unevictable list. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (vm_flags & VM_LOCKED) | 
					
						
							|  |  |  | 		return PAGEREF_RECLAIM; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 	if (referenced_ptes) { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:45 -07:00
										 |  |  | 		if (PageSwapBacked(page)) | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 			return PAGEREF_ACTIVATE; | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * All mapped pages start out with page table | 
					
						
							|  |  |  | 		 * references from the instantiating fault, so we need | 
					
						
							|  |  |  | 		 * to look twice if a mapped file page is used more | 
					
						
							|  |  |  | 		 * than once. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * Mark it and spare it for another trip around the | 
					
						
							|  |  |  | 		 * inactive list.  Another page table reference will | 
					
						
							|  |  |  | 		 * lead to its activation. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * Note: the mark is set for activated pages as well | 
					
						
							|  |  |  | 		 * so that recently deactivated but used pages are | 
					
						
							|  |  |  | 		 * quickly recovered. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		SetPageReferenced(page); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-10 15:06:59 -08:00
										 |  |  | 		if (referenced_page || referenced_ptes > 1) | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 			return PAGEREF_ACTIVATE; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-10 15:07:03 -08:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Activate file-backed executable pages after first usage. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (vm_flags & VM_EXEC) | 
					
						
							|  |  |  | 			return PAGEREF_ACTIVATE; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 		return PAGEREF_KEEP; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* Reclaim if clean, defer dirty pages to writeback */ | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:46 -07:00
										 |  |  | 	if (referenced_page && !PageSwapBacked(page)) | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 		return PAGEREF_RECLAIM_CLEAN; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return PAGEREF_RECLAIM; | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | /* Check if a page is dirty or under writeback */ | 
					
						
							|  |  |  | static void page_check_dirty_writeback(struct page *page, | 
					
						
							|  |  |  | 				       bool *dirty, bool *writeback) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:05 -07:00
										 |  |  | 	struct address_space *mapping; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Anonymous pages are not handled by flushers and must be written | 
					
						
							|  |  |  | 	 * from reclaim context. Do not stall reclaim based on them | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!page_is_file_cache(page)) { | 
					
						
							|  |  |  | 		*dirty = false; | 
					
						
							|  |  |  | 		*writeback = false; | 
					
						
							|  |  |  | 		return; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* By default assume that the page flags are accurate */ | 
					
						
							|  |  |  | 	*dirty = PageDirty(page); | 
					
						
							|  |  |  | 	*writeback = PageWriteback(page); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:05 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* Verify dirty/writeback state if the filesystem supports it */ | 
					
						
							|  |  |  | 	if (!page_has_private(page)) | 
					
						
							|  |  |  | 		return; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	mapping = page_mapping(page); | 
					
						
							|  |  |  | 	if (mapping && mapping->a_ops->is_dirty_writeback) | 
					
						
							|  |  |  | 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback); | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
											
												[PATCH] vmscan: rename functions
We have:
	try_to_free_pages
	->shrink_caches(struct zone **zones, ..)
	  ->shrink_zone(struct zone *, ...)
	    ->shrink_cache(struct zone *, ...)
	      ->shrink_list(struct list_head *, ...)
	    ->refill_inactive_list((struct zone *, ...)
which is fairly irrational.
Rename things so that we have
 	try_to_free_pages
 	->shrink_zones(struct zone **zones, ..)
 	  ->shrink_zone(struct zone *, ...)
 	    ->shrink_inactive_list(struct zone *, ...)
 	      ->shrink_page_list(struct list_head *, ...)
	    ->shrink_active_list(struct zone *, ...)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-03-22 00:08:21 -08:00
										 |  |  |  * shrink_page_list() returns the number of reclaimed pages | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
											
												[PATCH] vmscan: rename functions
We have:
	try_to_free_pages
	->shrink_caches(struct zone **zones, ..)
	  ->shrink_zone(struct zone *, ...)
	    ->shrink_cache(struct zone *, ...)
	      ->shrink_list(struct list_head *, ...)
	    ->refill_inactive_list((struct zone *, ...)
which is fairly irrational.
Rename things so that we have
 	try_to_free_pages
 	->shrink_zones(struct zone **zones, ..)
 	  ->shrink_zone(struct zone *, ...)
 	    ->shrink_inactive_list(struct zone *, ...)
 	      ->shrink_page_list(struct list_head *, ...)
	    ->shrink_active_list(struct zone *, ...)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-03-22 00:08:21 -08:00
										 |  |  | static unsigned long shrink_page_list(struct list_head *page_list, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:59 -07:00
										 |  |  | 				      struct zone *zone, | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:51 -07:00
										 |  |  | 				      struct scan_control *sc, | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 				      enum ttu_flags ttu_flags, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 				      unsigned long *ret_nr_dirty, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 				      unsigned long *ret_nr_unqueued_dirty, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 				      unsigned long *ret_nr_congested, | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 				      unsigned long *ret_nr_writeback, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 				      unsigned long *ret_nr_immediate, | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 				      bool force_reclaim) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	LIST_HEAD(ret_pages); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:31 -07:00
										 |  |  | 	LIST_HEAD(free_pages); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	int pgactivate = 0; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 	unsigned long nr_unqueued_dirty = 0; | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:45 -07:00
										 |  |  | 	unsigned long nr_dirty = 0; | 
					
						
							|  |  |  | 	unsigned long nr_congested = 0; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:20 -08:00
										 |  |  | 	unsigned long nr_reclaimed = 0; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 	unsigned long nr_writeback = 0; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 	unsigned long nr_immediate = 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	cond_resched(); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:46:08 -07:00
										 |  |  | 	mem_cgroup_uncharge_start(); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	while (!list_empty(page_list)) { | 
					
						
							|  |  |  | 		struct address_space *mapping; | 
					
						
							|  |  |  | 		struct page *page; | 
					
						
							|  |  |  | 		int may_enter_fs; | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 		enum page_references references = PAGEREF_RECLAIM_CLEAN; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 		bool dirty, writeback; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		cond_resched(); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		page = lru_to_page(page_list); | 
					
						
							|  |  |  | 		list_del(&page->lru); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-08-02 12:01:03 +02:00
										 |  |  | 		if (!trylock_page(page)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			goto keep; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-09-25 23:30:55 -07:00
										 |  |  | 		VM_BUG_ON(PageActive(page)); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:59 -07:00
										 |  |  | 		VM_BUG_ON(page_zone(page) != zone); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		sc->nr_scanned++; | 
					
						
							| 
									
										
										
										
											2006-02-11 17:55:53 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 		if (unlikely(!page_evictable(page))) | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | 			goto cull_mlocked; | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 		if (!sc->may_unmap && page_mapped(page)) | 
					
						
							| 
									
										
										
										
											2006-02-11 17:55:53 -08:00
										 |  |  | 			goto keep_locked; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		/* Double the slab pressure for mapped and swapcache pages */ | 
					
						
							|  |  |  | 		if (page_mapped(page) || PageSwapCache(page)) | 
					
						
							|  |  |  | 			sc->nr_scanned++; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | 		may_enter_fs = (sc->gfp_mask & __GFP_FS) || | 
					
						
							|  |  |  | 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * The number of dirty pages determines if a zone is marked | 
					
						
							|  |  |  | 		 * reclaim_congested which affects wait_iff_congested. kswapd | 
					
						
							|  |  |  | 		 * will stall and start writing pages if the tail of the LRU | 
					
						
							|  |  |  | 		 * is all dirty unqueued pages. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		page_check_dirty_writeback(page, &dirty, &writeback); | 
					
						
							|  |  |  | 		if (dirty || writeback) | 
					
						
							|  |  |  | 			nr_dirty++; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		if (dirty && !writeback) | 
					
						
							|  |  |  | 			nr_unqueued_dirty++; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:03 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Treat this page as congested if the underlying BDI is or if | 
					
						
							|  |  |  | 		 * pages are cycling through the LRU so quickly that the | 
					
						
							|  |  |  | 		 * pages marked for immediate reclaim are making it to the | 
					
						
							|  |  |  | 		 * end of the LRU a second time. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 		mapping = page_mapping(page); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:03 -07:00
										 |  |  | 		if ((mapping && bdi_write_congested(mapping->backing_dev_info)) || | 
					
						
							|  |  |  | 		    (writeback && PageReclaim(page))) | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 			nr_congested++; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If a page at the tail of the LRU is under writeback, there | 
					
						
							|  |  |  | 		 * are three cases to consider. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * 1) If reclaim is encountering an excessive number of pages | 
					
						
							|  |  |  | 		 *    under writeback and this page is both under writeback and | 
					
						
							|  |  |  | 		 *    PageReclaim then it indicates that pages are being queued | 
					
						
							|  |  |  | 		 *    for IO but are being recycled through the LRU before the | 
					
						
							|  |  |  | 		 *    IO can complete. Waiting on the page itself risks an | 
					
						
							|  |  |  | 		 *    indefinite stall if it is impossible to writeback the | 
					
						
							|  |  |  | 		 *    page due to IO error or disconnected storage so instead | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 		 *    note that the LRU is being scanned too quickly and the | 
					
						
							|  |  |  | 		 *    caller can stall after page list has been processed. | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 		 * | 
					
						
							|  |  |  | 		 * 2) Global reclaim encounters a page, memcg encounters a | 
					
						
							|  |  |  | 		 *    page that is not marked for immediate reclaim or | 
					
						
							|  |  |  | 		 *    the caller does not have __GFP_IO. In this case mark | 
					
						
							|  |  |  | 		 *    the page for immediate reclaim and continue scanning. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 *    __GFP_IO is checked  because a loop driver thread might | 
					
						
							|  |  |  | 		 *    enter reclaim, and deadlock if it waits on a page for | 
					
						
							|  |  |  | 		 *    which it is needed to do the write (loop masks off | 
					
						
							|  |  |  | 		 *    __GFP_IO|__GFP_FS for this reason); but more thought | 
					
						
							|  |  |  | 		 *    would probably show more reasons. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 *    Don't require __GFP_FS, since we're not going into the | 
					
						
							|  |  |  | 		 *    FS, just waiting on its writeback completion. Worryingly, | 
					
						
							|  |  |  | 		 *    ext4 gfs2 and xfs allocate pages with | 
					
						
							|  |  |  | 		 *    grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing | 
					
						
							|  |  |  | 		 *    may_enter_fs here is liable to OOM on them. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * 3) memcg encounters a page that is not already marked | 
					
						
							|  |  |  | 		 *    PageReclaim. memcg does not have any dirty pages | 
					
						
							|  |  |  | 		 *    throttling so we could easily OOM just because too many | 
					
						
							|  |  |  | 		 *    pages are in writeback and there is nothing else to | 
					
						
							|  |  |  | 		 *    reclaim. Wait for the writeback to complete. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | 		if (PageWriteback(page)) { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 			/* Case 1 above */ | 
					
						
							|  |  |  | 			if (current_is_kswapd() && | 
					
						
							|  |  |  | 			    PageReclaim(page) && | 
					
						
							|  |  |  | 			    zone_is_reclaim_writeback(zone)) { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 				nr_immediate++; | 
					
						
							|  |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			/* Case 2 above */ | 
					
						
							|  |  |  | 			} else if (global_reclaim(sc) || | 
					
						
							| 
									
										
											  
											
												memcg: further prevent OOM with too many dirty pages
The may_enter_fs test turns out to be too restrictive: though I saw no
problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
slightly changed the way I started off the testing: dd if=/dev/zero
of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
memory.limit_in_bytes cgroup to ext4 on USB stick.
ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
the transaction needs to be started even before allocating pagecache
memory.  But it may not be worth worrying about these days: if direct
reclaim avoids FS writeback, does __GFP_FS now mean anything?
Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
device; but since that also masks off __GFP_IO, we can test for __GFP_IO
directly, ignoring may_enter_fs and __GFP_FS.
But even so, the test still OOMs sometimes: when originally testing on
3.5-rc6, it OOMed about one time in five or ten; when testing just now on
3.5-rc6-mm1, it OOMed on the first iteration.
This residual problem comes from an accumulation of pages under ordinary
writeback, not marked PageReclaim, so rightly not causing the memcg check
to wait on their writeback: these too can prevent shrink_page_list() from
freeing any pages, so many times that memcg reclaim fails and OOMs.
Deal with these in the same way as direct reclaim now deals with dirty FS
pages: mark them PageReclaim.  It is appropriate to rotate these to tail
of list when writepage completes, but more importantly, the PageReclaim
flag makes memcg reclaim wait on them if encountered again.  Increment
NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.
With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.
Trivia: invert conditions for a clearer block without an else, and goto
keep_locked to do the unlock_page.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-07-31 16:45:59 -07:00
										 |  |  | 			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { | 
					
						
							|  |  |  | 				/*
 | 
					
						
							|  |  |  | 				 * This is slightly racy - end_page_writeback() | 
					
						
							|  |  |  | 				 * might have just cleared PageReclaim, then | 
					
						
							|  |  |  | 				 * setting PageReclaim here end up interpreted | 
					
						
							|  |  |  | 				 * as PageReadahead - but that does not matter | 
					
						
							|  |  |  | 				 * enough to care.  What we do want is for this | 
					
						
							|  |  |  | 				 * page to have PageReclaim set next time memcg | 
					
						
							|  |  |  | 				 * reclaim reaches the tests above, so it will | 
					
						
							|  |  |  | 				 * then wait_on_page_writeback() to avoid OOM; | 
					
						
							|  |  |  | 				 * and it's also appropriate in global reclaim. | 
					
						
							|  |  |  | 				 */ | 
					
						
							|  |  |  | 				SetPageReclaim(page); | 
					
						
							| 
									
										
											  
											
												memcg: prevent OOM with too many dirty pages
The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages.  Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.
This patch fixes the problem by throttling the allocating process
(possibly a writer) during the hard limit reclaim by waiting on
PageReclaim pages.  We are waiting only for PageReclaim pages because
those are the pages that made one full round over LRU and that means that
the writeback is much slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.  We are seeing this happening during nightly backups which are placed
into containers to prevent from eviction of the real working set.
The change affects only memcg reclaim and only when we encounter
PageReclaim pages which is a signal that the reclaim doesn't catch up on
with the writers so somebody should be throttled.  This could be
potentially unfair because it could be somebody else from the group who
gets throttled on behalf of the writer but as writers need to allocate as
well and they allocate in higher rate the probability that only innocent
processes would be penalized is not that high.
I have tested this change by a simple dd copying /dev/zero to tmpfs or
ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
containers) and dd got killed by OOM killer every time.  With the patch I
could run the dd with the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.
* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
[hughd@google.com: fix deadlock with loop driver]
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-07-31 16:45:55 -07:00
										 |  |  | 				nr_writeback++; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												memcg: further prevent OOM with too many dirty pages
The may_enter_fs test turns out to be too restrictive: though I saw no
problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
slightly changed the way I started off the testing: dd if=/dev/zero
of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
memory.limit_in_bytes cgroup to ext4 on USB stick.
ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
the transaction needs to be started even before allocating pagecache
memory.  But it may not be worth worrying about these days: if direct
reclaim avoids FS writeback, does __GFP_FS now mean anything?
Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
device; but since that also masks off __GFP_IO, we can test for __GFP_IO
directly, ignoring may_enter_fs and __GFP_FS.
But even so, the test still OOMs sometimes: when originally testing on
3.5-rc6, it OOMed about one time in five or ten; when testing just now on
3.5-rc6-mm1, it OOMed on the first iteration.
This residual problem comes from an accumulation of pages under ordinary
writeback, not marked PageReclaim, so rightly not causing the memcg check
to wait on their writeback: these too can prevent shrink_page_list() from
freeing any pages, so many times that memcg reclaim fails and OOMs.
Deal with these in the same way as direct reclaim now deals with dirty FS
pages: mark them PageReclaim.  It is appropriate to rotate these to tail
of list when writepage completes, but more importantly, the PageReclaim
flag makes memcg reclaim wait on them if encountered again.  Increment
NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.
With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.
Trivia: invert conditions for a clearer block without an else, and goto
keep_locked to do the unlock_page.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-07-31 16:45:59 -07:00
										 |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			/* Case 3 above */ | 
					
						
							|  |  |  | 			} else { | 
					
						
							|  |  |  | 				wait_on_page_writeback(page); | 
					
						
							| 
									
										
											  
											
												memcg: prevent OOM with too many dirty pages
The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages.  Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.
This patch fixes the problem by throttling the allocating process
(possibly a writer) during the hard limit reclaim by waiting on
PageReclaim pages.  We are waiting only for PageReclaim pages because
those are the pages that made one full round over LRU and that means that
the writeback is much slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.  We are seeing this happening during nightly backups which are placed
into containers to prevent from eviction of the real working set.
The change affects only memcg reclaim and only when we encounter
PageReclaim pages which is a signal that the reclaim doesn't catch up on
with the writers so somebody should be throttled.  This could be
potentially unfair because it could be somebody else from the group who
gets throttled on behalf of the writer but as writers need to allocate as
well and they allocate in higher rate the probability that only innocent
processes would be penalized is not that high.
I have tested this change by a simple dd copying /dev/zero to tmpfs or
ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
containers) and dd got killed by OOM killer every time.  With the patch I
could run the dd with the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.
* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
[hughd@google.com: fix deadlock with loop driver]
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-07-31 16:45:55 -07:00
										 |  |  | 			} | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 		if (!force_reclaim) | 
					
						
							|  |  |  | 			references = page_check_references(page, sc); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 		switch (references) { | 
					
						
							|  |  |  | 		case PAGEREF_ACTIVATE: | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			goto activate_locked; | 
					
						
							| 
									
										
										
										
											2010-03-05 13:42:22 -08:00
										 |  |  | 		case PAGEREF_KEEP: | 
					
						
							|  |  |  | 			goto keep_locked; | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 		case PAGEREF_RECLAIM: | 
					
						
							|  |  |  | 		case PAGEREF_RECLAIM_CLEAN: | 
					
						
							|  |  |  | 			; /* try to reclaim the page below */ | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Anonymous process memory has backing store? | 
					
						
							|  |  |  | 		 * Try to allocate it some swap space here. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | 		if (PageAnon(page) && !PageSwapCache(page)) { | 
					
						
							| 
									
										
										
										
											2008-11-19 15:36:37 -08:00
										 |  |  | 			if (!(sc->gfp_mask & __GFP_IO)) | 
					
						
							|  |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
										
										
											2013-04-29 15:08:36 -07:00
										 |  |  | 			if (!add_to_swap(page, page_list)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				goto activate_locked; | 
					
						
							| 
									
										
										
										
											2008-11-19 15:36:37 -08:00
										 |  |  | 			may_enter_fs = 1; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 			/* Adding to swap updated mapping */ | 
					
						
							|  |  |  | 			mapping = page_mapping(page); | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * The page is mapped into the page tables of one or more | 
					
						
							|  |  |  | 		 * processes. Try to unmap it here. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (page_mapped(page) && mapping) { | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 			switch (try_to_unmap(page, ttu_flags)) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			case SWAP_FAIL: | 
					
						
							|  |  |  | 				goto activate_locked; | 
					
						
							|  |  |  | 			case SWAP_AGAIN: | 
					
						
							|  |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | 			case SWAP_MLOCK: | 
					
						
							|  |  |  | 				goto cull_mlocked; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			case SWAP_SUCCESS: | 
					
						
							|  |  |  | 				; /* try to free the page below */ | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		if (PageDirty(page)) { | 
					
						
							| 
									
										
											  
											
												mm: vmscan: do not writeback filesystem pages in direct reclaim
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd.  Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low).  That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
	It doesn't have to be a very high number to be a problem. IO
	is orders of magnitude slower than the CPU time it takes to
	flush a page, so the cost of making a bad flush decision is
	very high. And single page writeback from the LRU is almost
	always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
	xfs tries to write it back if the requester is kswapd
	ext4 ignores the request if it's a delayed allocation
	btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied.  In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
	entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
	to synchronously write pages. This hurts lumpy reclaim but
	there is an expectation that compaction is used for hugepage
	allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
	direct reclaim. With patch 1, this "never happens" and is
	intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
	the priority is raised to the point where kswapd is considered
	to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
	encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
	are reclaimed quickly after being written back rather than
	waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings.  The command line for fs_mark when
botted with 512M looked something like
./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM.  For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available.  Swap is on a
separate disk.  Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
512M-xfs             Elapsed Time fsmark                    56.08     48.90
512M-xfs             Elapsed Time simple-wb                112.22     98.13
512M-xfs             Elapsed Time mmap-strm                219.15    196.67
512M-xfs             Kswapd efficiency fsmark                 54%       56%
512M-xfs             Kswapd efficiency simple-wb              54%       55%
512M-xfs             Kswapd efficiency mmap-strm              45%       44%
512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely.  The time to completion for all tests is improved which is an
important result.  In general, kswapd efficiency is not affected by
skipping dirty pages.
1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
1024M-xfs            Elapsed Time fsmark                    94.59     91.00
1024M-xfs            Elapsed Time simple-wb                229.84    195.08
1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
1024M-xfs            Kswapd efficiency fsmark                 79%       71%
1024M-xfs            Kswapd efficiency simple-wb              74%       74%
1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped.  The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim.  In other words, roughly the same number of pages were reclaimed
but swapping was slower.  As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
4608M-xfs            Elapsed Time fsmark                   555.96    532.34
4608M-xfs            Elapsed Time simple-wb                659.72    571.85
4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
4608M-xfs            Kswapd efficiency fsmark                 89%       91%
4608M-xfs            Kswapd efficiency simple-wb              88%       82%
4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well.  For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced.  KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher.  However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage.  A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
1024M-xfs            Kswapd efficiency fsmark              84%       85%
1024M-xfs            Kswapd efficiency simple-wb           92%       81%
1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory.  On the postive side, firefox launched
marginally faster with the patches applied.  Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive.  It was
also the case that "Avg wait ms" on the root filesystem was lower.  I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does.  If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems.  First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex.  Some
filesystems ignore write requests from direct reclaim as a result.  The
second is that a single-page flush is inefficient in terms of IO.  While
there is an expectation that the elevator will merge requests, this does
not always happen.  Quoting Christoph Hellwig;
	The elevator has a relatively small window it can operate on,
	and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd.  Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists.  There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:07:38 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Only kswapd can writeback filesystem pages to | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 			 * avoid risk of stack overflow but only writeback | 
					
						
							|  |  |  | 			 * if many dirty pages have been encountered. | 
					
						
							| 
									
										
											  
											
												mm: vmscan: do not writeback filesystem pages in direct reclaim
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd.  Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low).  That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
	It doesn't have to be a very high number to be a problem. IO
	is orders of magnitude slower than the CPU time it takes to
	flush a page, so the cost of making a bad flush decision is
	very high. And single page writeback from the LRU is almost
	always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
	xfs tries to write it back if the requester is kswapd
	ext4 ignores the request if it's a delayed allocation
	btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied.  In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
	entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
	to synchronously write pages. This hurts lumpy reclaim but
	there is an expectation that compaction is used for hugepage
	allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
	direct reclaim. With patch 1, this "never happens" and is
	intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
	the priority is raised to the point where kswapd is considered
	to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
	encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
	are reclaimed quickly after being written back rather than
	waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings.  The command line for fs_mark when
botted with 512M looked something like
./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM.  For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available.  Swap is on a
separate disk.  Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
512M-xfs             Elapsed Time fsmark                    56.08     48.90
512M-xfs             Elapsed Time simple-wb                112.22     98.13
512M-xfs             Elapsed Time mmap-strm                219.15    196.67
512M-xfs             Kswapd efficiency fsmark                 54%       56%
512M-xfs             Kswapd efficiency simple-wb              54%       55%
512M-xfs             Kswapd efficiency mmap-strm              45%       44%
512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely.  The time to completion for all tests is improved which is an
important result.  In general, kswapd efficiency is not affected by
skipping dirty pages.
1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
1024M-xfs            Elapsed Time fsmark                    94.59     91.00
1024M-xfs            Elapsed Time simple-wb                229.84    195.08
1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
1024M-xfs            Kswapd efficiency fsmark                 79%       71%
1024M-xfs            Kswapd efficiency simple-wb              74%       74%
1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped.  The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim.  In other words, roughly the same number of pages were reclaimed
but swapping was slower.  As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
4608M-xfs            Elapsed Time fsmark                   555.96    532.34
4608M-xfs            Elapsed Time simple-wb                659.72    571.85
4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
4608M-xfs            Kswapd efficiency fsmark                 89%       91%
4608M-xfs            Kswapd efficiency simple-wb              88%       82%
4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well.  For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced.  KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher.  However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage.  A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
1024M-xfs            Kswapd efficiency fsmark              84%       85%
1024M-xfs            Kswapd efficiency simple-wb           92%       81%
1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory.  On the postive side, firefox launched
marginally faster with the patches applied.  Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive.  It was
also the case that "Avg wait ms" on the root filesystem was lower.  I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does.  If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems.  First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex.  Some
filesystems ignore write requests from direct reclaim as a result.  The
second is that a single-page flush is inefficient in terms of IO.  While
there is an expectation that the elevator will merge requests, this does
not always happen.  Quoting Christoph Hellwig;
	The elevator has a relatively small window it can operate on,
	and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd.  Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists.  There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:07:38 -07:00
										 |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:51 -07:00
										 |  |  | 			if (page_is_file_cache(page) && | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 					(!current_is_kswapd() || | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 					 !zone_is_reclaim_dirty(zone))) { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:59 -07:00
										 |  |  | 				/*
 | 
					
						
							|  |  |  | 				 * Immediately reclaim when written back. | 
					
						
							|  |  |  | 				 * Similar in principal to deactivate_page() | 
					
						
							|  |  |  | 				 * except we already have the page isolated | 
					
						
							|  |  |  | 				 * and know it's dirty | 
					
						
							|  |  |  | 				 */ | 
					
						
							|  |  |  | 				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE); | 
					
						
							|  |  |  | 				SetPageReclaim(page); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: do not writeback filesystem pages in direct reclaim
Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd.  Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low).  That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;
	It doesn't have to be a very high number to be a problem. IO
	is orders of magnitude slower than the CPU time it takes to
	flush a page, so the cost of making a bad flush decision is
	very high. And single page writeback from the LRU is almost
	always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;
	xfs tries to write it back if the requester is kswapd
	ext4 ignores the request if it's a delayed allocation
	btrfs ignores the request
As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied.  In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.
The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.
A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.
There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.
Patch 1 disables writeback of filesystem pages from direct reclaim
	entirely. Anonymous pages are still written.
Patch 2 removes dead code in lumpy reclaim as it is no longer able
	to synchronously write pages. This hurts lumpy reclaim but
	there is an expectation that compaction is used for hugepage
	allocations these days and lumpy reclaim's days are numbered.
Patches 3-4 add warnings to XFS and ext4 if called from
	direct reclaim. With patch 1, this "never happens" and is
	intended to catch regressions in this logic in the future.
Patch 5 disables writeback of filesystem pages from kswapd unless
	the priority is raised to the point where kswapd is considered
	to be in trouble.
Patch 6 throttles reclaimers if too many dirty pages are being
	encountered and the zones or backing devices are congested.
Patch 7 invalidates dirty pages found at the end of the LRU so they
	are reclaimed quickly after being written back rather than
	waiting for a reclaimer to find them
I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.
I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings.  The command line for fs_mark when
botted with 512M looked something like
./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM.  For multiple threads,
the -d switch is specified multiple times.
The test machine is x86-64 with an older generation of AMD processor with
4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available.  Swap is on a
separate disk.  Dirty ratio was tuned to 40% instead of the default of
20%.
Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.
Here is a summary of results based on testing XFS.
512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
512M-xfs             Elapsed Time fsmark                    56.08     48.90
512M-xfs             Elapsed Time simple-wb                112.22     98.13
512M-xfs             Elapsed Time mmap-strm                219.15    196.67
512M-xfs             Kswapd efficiency fsmark                 54%       56%
512M-xfs             Kswapd efficiency simple-wb              54%       55%
512M-xfs             Kswapd efficiency mmap-strm              45%       44%
512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%
Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely.  The time to completion for all tests is improved which is an
important result.  In general, kswapd efficiency is not affected by
skipping dirty pages.
1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
1024M-xfs            Elapsed Time fsmark                    94.59     91.00
1024M-xfs            Elapsed Time simple-wb                229.84    195.08
1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
1024M-xfs            Kswapd efficiency fsmark                 79%       71%
1024M-xfs            Kswapd efficiency simple-wb              74%       74%
1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%
All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases
In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped.  The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim.  In other words, roughly the same number of pages were reclaimed
but swapping was slower.  As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.
4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
4608M-xfs            Elapsed Time fsmark                   555.96    532.34
4608M-xfs            Elapsed Time simple-wb                659.72    571.85
4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
4608M-xfs            Kswapd efficiency fsmark                 89%       91%
4608M-xfs            Kswapd efficiency simple-wb              88%       82%
4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%
Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.
There are other indications that this is an improvement as well.  For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced.  KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable
In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher.  However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.
On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage.  A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.
1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
1024M-xfs            Kswapd efficiency fsmark              84%       85%
1024M-xfs            Kswapd efficiency simple-wb           92%       81%
1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53
The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory.  On the postive side, firefox launched
marginally faster with the patches applied.  Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive.  It was
also the case that "Avg wait ms" on the root filesystem was lower.  I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.
This patch: do not writeback filesystem pages in direct reclaim.
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does.  If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.
This causes two problems.  First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex.  Some
filesystems ignore write requests from direct reclaim as a result.  The
second is that a single-page flush is inefficient in terms of IO.  While
there is an expectation that the elevator will merge requests, this does
not always happen.  Quoting Christoph Hellwig;
	The elevator has a relatively small window it can operate on,
	and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd.  Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists.  There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:07:38 -07:00
										 |  |  | 				goto keep_locked; | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: factor out page reference checks
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again.  When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress.  Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place.  The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process.  Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again.  While this happens, every 5 seconds a process is launched and
its execution time taken:
	python2.4 -c 'import pydoc'
	old: max=2.31s mean=1.26s (0.34)
	new: max=1.25s mean=0.32s (0.32)
	find /etc -type f
	old: max=2.52s mean=1.44s (0.43)
	new: max=1.92s mean=0.12s (0.17)
	vim -c ':quit'
	old: max=6.14s mean=4.03s (0.49)
	new: max=3.48s mean=2.41s (0.25)
	mplayer --help
	old: max=8.08s mean=5.74s (1.02)
	new: max=3.79s mean=1.32s (0.81)
	overall hash time (stdev):
	old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
	new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list.  The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved.  The test would mmap a file of 20MB, then
  1. touch all its pages to fault them in
  2. force one full scan cycle on the inactive file LRU
  -- old: mapping pages activated
  -- new: mapping pages inactive
  3. touch the mapping pages again
  -- old and new: fault exceptions to set the young bits
  4. force another full scan cycle on the inactive file LRU
  5. touch the mapping pages one last time
  -- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec).  13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated.  Note:
  1. File mapping the size of one third of main memory, _completely_
  in active use across memory pressure - i.e., most pages referenced
  within one LRU cycle.  This should be rare to non-existant,
  especially on such embedded setups.
  2. Many huge activation batches.  Those batches only occur when the
  working set fluctuates.  If it changes completely between every full
  LRU cycle, you have problematic reclaim overhead anyway.
  3. Access of activated pages at maximum speed: sequential loads from
  every single page without doing anything in between.  In reality,
  the extra faults will get distributed between actual operations on
  the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse().  It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-03-05 13:42:19 -08:00
										 |  |  | 			if (references == PAGEREF_RECLAIM_CLEAN) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
										
										
											2008-03-24 12:29:52 -07:00
										 |  |  | 			if (!may_enter_fs) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				goto keep_locked; | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:28 -08:00
										 |  |  | 			if (!sc->may_writepage) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				goto keep_locked; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			/* Page is dirty, try to write it out here */ | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | 			switch (pageout(page, mapping, sc)) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			case PAGE_KEEP: | 
					
						
							|  |  |  | 				goto keep_locked; | 
					
						
							|  |  |  | 			case PAGE_ACTIVATE: | 
					
						
							|  |  |  | 				goto activate_locked; | 
					
						
							|  |  |  | 			case PAGE_SUCCESS: | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | 				if (PageWriteback(page)) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:19 -07:00
										 |  |  | 					goto keep; | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | 				if (PageDirty(page)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 					goto keep; | 
					
						
							| 
									
										
											  
											
												vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying.  This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again
So, it's useless time wasting.  However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:42 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				/*
 | 
					
						
							|  |  |  | 				 * A synchronous write - probably a ramdisk.  Go | 
					
						
							|  |  |  | 				 * ahead and try to reclaim the page. | 
					
						
							|  |  |  | 				 */ | 
					
						
							| 
									
										
										
										
											2008-08-02 12:01:03 +02:00
										 |  |  | 				if (!trylock_page(page)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 					goto keep; | 
					
						
							|  |  |  | 				if (PageDirty(page) || PageWriteback(page)) | 
					
						
							|  |  |  | 					goto keep_locked; | 
					
						
							|  |  |  | 				mapping = page_mapping(page); | 
					
						
							|  |  |  | 			case PAGE_CLEAN: | 
					
						
							|  |  |  | 				; /* try to free the page below */ | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If the page has buffers, try to free the buffer mappings | 
					
						
							|  |  |  | 		 * associated with this page. If we succeed we try to free | 
					
						
							|  |  |  | 		 * the page as well. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * We do this even if the page is PageDirty(). | 
					
						
							|  |  |  | 		 * try_to_release_page() does not perform I/O, but it is | 
					
						
							|  |  |  | 		 * possible for a page to have PageDirty set, but it is actually | 
					
						
							|  |  |  | 		 * clean (all its buffers are clean).  This happens if the | 
					
						
							|  |  |  | 		 * buffers were written out directly, with submit_bh(). ext3 | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		 * will do this, as well as the blockdev mapping. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		 * try_to_release_page() will discover that cleanness and will | 
					
						
							|  |  |  | 		 * drop the buffers and mark the page clean - it can be freed. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * Rarely, pages can have buffers and no ->mapping.  These are | 
					
						
							|  |  |  | 		 * the pages which were not successfully invalidated in | 
					
						
							|  |  |  | 		 * truncate_complete_page().  We try to drop those buffers here | 
					
						
							|  |  |  | 		 * and if that worked, and the page is no longer mapped into | 
					
						
							|  |  |  | 		 * process address space (page_count == 1) it can be freed. | 
					
						
							|  |  |  | 		 * Otherwise, leave the page on the LRU so it is swappable. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2009-04-03 16:42:36 +01:00
										 |  |  | 		if (page_has_private(page)) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			if (!try_to_release_page(page, sc->gfp_mask)) | 
					
						
							|  |  |  | 				goto activate_locked; | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 			if (!mapping && page_count(page) == 1) { | 
					
						
							|  |  |  | 				unlock_page(page); | 
					
						
							|  |  |  | 				if (put_page_testzero(page)) | 
					
						
							|  |  |  | 					goto free_it; | 
					
						
							|  |  |  | 				else { | 
					
						
							|  |  |  | 					/*
 | 
					
						
							|  |  |  | 					 * rare race with speculative reference. | 
					
						
							|  |  |  | 					 * the speculative reference will free | 
					
						
							|  |  |  | 					 * this page shortly, so we may | 
					
						
							|  |  |  | 					 * increment nr_reclaimed here (and | 
					
						
							|  |  |  | 					 * leave it off the LRU). | 
					
						
							|  |  |  | 					 */ | 
					
						
							|  |  |  | 					nr_reclaimed++; | 
					
						
							|  |  |  | 					continue; | 
					
						
							|  |  |  | 				} | 
					
						
							|  |  |  | 			} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | 		if (!mapping || !__remove_mapping(mapping, page)) | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:48 -08:00
										 |  |  | 			goto keep_locked; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:58 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * At this point, we have no other references and there is | 
					
						
							|  |  |  | 		 * no way to pick any more up (removed from LRU, removed | 
					
						
							|  |  |  | 		 * from pagecache). Can use non-atomic bitops now (and | 
					
						
							|  |  |  | 		 * we obviously don't have to worry about waking up a process | 
					
						
							|  |  |  | 		 * waiting on the page lock, because there are no references. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		__clear_page_locked(page); | 
					
						
							| 
									
										
										
										
											2008-07-25 19:45:30 -07:00
										 |  |  | free_it: | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:20 -08:00
										 |  |  | 		nr_reclaimed++; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:31 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Is there need to periodically free_page_list? It would | 
					
						
							|  |  |  | 		 * appear not as the counts should be low | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		list_add(&page->lru, &free_pages); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | cull_mlocked: | 
					
						
							| 
									
										
										
										
											2009-01-06 14:39:38 -08:00
										 |  |  | 		if (PageSwapCache(page)) | 
					
						
							|  |  |  | 			try_to_free_swap(page); | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | 		unlock_page(page); | 
					
						
							|  |  |  | 		putback_lru_page(page); | 
					
						
							|  |  |  | 		continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | activate_locked: | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:23 -07:00
										 |  |  | 		/* Not a candidate for swapping, so reclaim swap space. */ | 
					
						
							|  |  |  | 		if (PageSwapCache(page) && vm_swap_full()) | 
					
						
							| 
									
										
										
										
											2009-01-06 14:39:36 -08:00
										 |  |  | 			try_to_free_swap(page); | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		VM_BUG_ON(PageActive(page)); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		SetPageActive(page); | 
					
						
							|  |  |  | 		pgactivate++; | 
					
						
							|  |  |  | keep_locked: | 
					
						
							|  |  |  | 		unlock_page(page); | 
					
						
							|  |  |  | keep: | 
					
						
							|  |  |  | 		list_add(&page->lru, &ret_pages); | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  | 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:31 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-10 15:07:04 -08:00
										 |  |  | 	free_hot_cold_page_list(&free_pages, 1); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:31 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	list_splice(&ret_pages, page_list); | 
					
						
							| 
									
										
										
										
											2006-06-30 01:55:45 -07:00
										 |  |  | 	count_vm_events(PGACTIVATE, pgactivate); | 
					
						
							| 
									
										
										
										
											2012-07-31 16:46:08 -07:00
										 |  |  | 	mem_cgroup_uncharge_end(); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 	*ret_nr_dirty += nr_dirty; | 
					
						
							|  |  |  | 	*ret_nr_congested += nr_congested; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 	*ret_nr_unqueued_dirty += nr_unqueued_dirty; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 	*ret_nr_writeback += nr_writeback; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 	*ret_nr_immediate += nr_immediate; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:20 -08:00
										 |  |  | 	return nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | unsigned long reclaim_clean_pages_from_list(struct zone *zone, | 
					
						
							|  |  |  | 					    struct list_head *page_list) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct scan_control sc = { | 
					
						
							|  |  |  | 		.gfp_mask = GFP_KERNEL, | 
					
						
							|  |  |  | 		.priority = DEF_PRIORITY, | 
					
						
							|  |  |  | 		.may_unmap = 1, | 
					
						
							|  |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 	unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5; | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 	struct page *page, *next; | 
					
						
							|  |  |  | 	LIST_HEAD(clean_pages); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	list_for_each_entry_safe(page, next, page_list, lru) { | 
					
						
							| 
									
										
										
										
											2013-09-30 13:45:16 -07:00
										 |  |  | 		if (page_is_file_cache(page) && !PageDirty(page) && | 
					
						
							|  |  |  | 		    !isolated_balloon_page(page)) { | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 			ClearPageActive(page); | 
					
						
							|  |  |  | 			list_move(&page->lru, &clean_pages); | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	ret = shrink_page_list(&clean_pages, zone, &sc, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 			TTU_UNMAP|TTU_IGNORE_ACCESS, | 
					
						
							|  |  |  | 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true); | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 	list_splice(&clean_pages, page_list); | 
					
						
							|  |  |  | 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); | 
					
						
							|  |  |  | 	return ret; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Attempt to remove the specified page from its LRU.  Only take this page | 
					
						
							|  |  |  |  * if it is of the appropriate PageActive status.  Pages which are being | 
					
						
							|  |  |  |  * freed elsewhere are also ignored. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * page:	page to consider | 
					
						
							|  |  |  |  * mode:	one of the LRU isolation modes defined above | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * returns 0 on success, -ve errno on failure. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:54 -07:00
										 |  |  | int __isolate_lru_page(struct page *page, isolate_mode_t mode) | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	int ret = -EINVAL; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* Only take pages on the LRU. */ | 
					
						
							|  |  |  | 	if (!PageLRU(page)) | 
					
						
							|  |  |  | 		return ret; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:48 -07:00
										 |  |  | 	/* Compaction should not handle unevictable pages but CMA can do so */ | 
					
						
							|  |  |  | 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE)) | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 		return ret; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 	ret = -EBUSY; | 
					
						
							| 
									
										
											  
											
												memcg: synchronized LRU
A big patch for changing memcg's LRU semantics.
Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
  - LRU of page_cgroup is not synchronous with global LRU.
  - page and page_cgroup is one-to-one and statically allocated.
  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
  - SwapCache is handled.
And, when we handle LRU list of page_cgroup, we do following.
	pc = lookup_page_cgroup(page);
	lock_page_cgroup(pc); .....................(1)
	mz = page_cgroup_zoneinfo(pc);
	spin_lock(&mz->lru_lock);
	.....add to LRU
	spin_unlock(&mz->lru_lock);
	unlock_page_cgroup(pc);
But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	mem_cgroup_add/remove/etc_lru() {
		pc = lookup_page_cgroup(page);
		mz = page_cgroup_zoneinfo(pc);
		if (PageCgroupUsed(pc)) {
			....add to LRU
		}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.
Pros.
  - easy for maintenance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-01-07 18:08:01 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:38 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * To minimise LRU disruption, the caller can indicate that it only | 
					
						
							|  |  |  | 	 * wants to isolate pages it will be able to operate on without | 
					
						
							|  |  |  | 	 * blocking - clean pages for the most part. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * ISOLATE_CLEAN means that only clean pages should be isolated. This | 
					
						
							|  |  |  | 	 * is used by reclaim when it is cannot write to backing storage | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages | 
					
						
							|  |  |  | 	 * that it is possible to migrate without blocking | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) { | 
					
						
							|  |  |  | 		/* All the caller can do on PageWriteback is block */ | 
					
						
							|  |  |  | 		if (PageWriteback(page)) | 
					
						
							|  |  |  | 			return ret; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		if (PageDirty(page)) { | 
					
						
							|  |  |  | 			struct address_space *mapping; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			/* ISOLATE_CLEAN means only clean pages */ | 
					
						
							|  |  |  | 			if (mode & ISOLATE_CLEAN) | 
					
						
							|  |  |  | 				return ret; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Only pages without mappings or that have a | 
					
						
							|  |  |  | 			 * ->migratepage callback are possible to migrate | 
					
						
							|  |  |  | 			 * without blocking | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			mapping = page_mapping(page); | 
					
						
							|  |  |  | 			if (mapping && !mapping->a_ops->migratepage) | 
					
						
							|  |  |  | 				return ret; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:51 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) | 
					
						
							|  |  |  | 		return ret; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 	if (likely(get_page_unless_zero(page))) { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Be careful not to clear PageLRU until after we're | 
					
						
							|  |  |  | 		 * sure the page is not being freed elsewhere -- the | 
					
						
							|  |  |  | 		 * page release code relies on it. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		ClearPageLRU(page); | 
					
						
							|  |  |  | 		ret = 0; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return ret; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * zone->lru_lock is heavily contended.  Some of the functions that | 
					
						
							|  |  |  |  * shrink the lists perform better by taking out a batch of pages | 
					
						
							|  |  |  |  * and working on them outside the LRU lock. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * For pagecache intensive workloads, this function is the hottest | 
					
						
							|  |  |  |  * spot in the kernel (apart from copy_*_user functions). | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Appropriate locks must be held before calling this function. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * @nr_to_scan:	The number of pages to look through on the list. | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:58 -07:00
										 |  |  |  * @lruvec:	The LRU vector to pull pages from. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * @dst:	The temp list to put pages on to. | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  |  * @nr_scanned:	The number of pages that were scanned. | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:51 -07:00
										 |  |  |  * @sc:		The scan_control struct for this reclaim session | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  |  * @mode:	One of the LRU isolation modes | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:53 -07:00
										 |  |  |  * @lru:	LRU list id for isolating | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * returns how many pages were moved onto *@dst. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:19 -08:00
										 |  |  | static unsigned long isolate_lru_pages(unsigned long nr_to_scan, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:58 -07:00
										 |  |  | 		struct lruvec *lruvec, struct list_head *dst, | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:51 -07:00
										 |  |  | 		unsigned long *nr_scanned, struct scan_control *sc, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:53 -07:00
										 |  |  | 		isolate_mode_t mode, enum lru_list lru) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	struct list_head *src = &lruvec->lists[lru]; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:19 -08:00
										 |  |  | 	unsigned long nr_taken = 0; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:23 -08:00
										 |  |  | 	unsigned long scan; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:23 -08:00
										 |  |  | 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 		struct page *page; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		int nr_pages; | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		page = lru_to_page(src); | 
					
						
							|  |  |  | 		prefetchw_prev_lru_page(page, src, flags); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-09-25 23:30:55 -07:00
										 |  |  | 		VM_BUG_ON(!PageLRU(page)); | 
					
						
							| 
									
										
										
										
											2006-03-22 00:07:59 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:54 -07:00
										 |  |  | 		switch (__isolate_lru_page(page, mode)) { | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 		case 0: | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			nr_pages = hpage_nr_pages(page); | 
					
						
							|  |  |  | 			mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 			list_move(&page->lru, dst); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			nr_taken += nr_pages; | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 			break; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		case -EBUSY: | 
					
						
							|  |  |  | 			/* else it is being freed elsewhere */ | 
					
						
							|  |  |  | 			list_move(&page->lru, src); | 
					
						
							|  |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:07:58 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 		default: | 
					
						
							|  |  |  | 			BUG(); | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  | 	*nr_scanned = scan; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, | 
					
						
							|  |  |  | 				    nr_taken, mode, is_file_lru(lru)); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	return nr_taken; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * isolate_lru_page - tries to isolate a page from its LRU list | 
					
						
							|  |  |  |  * @page: page to isolate from its LRU list | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Isolates a @page from an LRU list, clears PageLRU and adjusts the | 
					
						
							|  |  |  |  * vmstat statistic corresponding to whatever LRU list the page was on. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Returns 0 if the page was removed from an LRU list. | 
					
						
							|  |  |  |  * Returns -EBUSY if the page was not on an LRU list. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * The returned page will have PageLRU() cleared.  If it was found on | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  |  * the active list, it will have PageActive set.  If it was found on | 
					
						
							|  |  |  |  * the unevictable list, it will have the PageUnevictable bit set. That flag | 
					
						
							|  |  |  |  * may need to be cleared by the caller before letting the page go. | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * The vmstat statistic corresponding to the list on which the page was | 
					
						
							|  |  |  |  * found will be decremented. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Restrictions: | 
					
						
							|  |  |  |  * (1) Must be called with an elevated refcount on the page. This is a | 
					
						
							|  |  |  |  *     fundamentnal difference from isolate_lru_pages (which is called | 
					
						
							|  |  |  |  *     without a stable reference). | 
					
						
							|  |  |  |  * (2) the lru_lock must not be held. | 
					
						
							|  |  |  |  * (3) interrupts must be enabled. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int isolate_lru_page(struct page *page) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	int ret = -EBUSY; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:21 -07:00
										 |  |  | 	VM_BUG_ON(!page_count(page)); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  | 	if (PageLRU(page)) { | 
					
						
							|  |  |  | 		struct zone *zone = page_zone(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		struct lruvec *lruvec; | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		spin_lock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		lruvec = mem_cgroup_page_lruvec(page, zone); | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:21 -07:00
										 |  |  | 		if (PageLRU(page)) { | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 			int lru = page_lru(page); | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:21 -07:00
										 |  |  | 			get_page(page); | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  | 			ClearPageLRU(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			del_page_from_lru_list(page, lruvec, lru); | 
					
						
							|  |  |  | 			ret = 0; | 
					
						
							| 
									
										
											  
											
												vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.
	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:09 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 		spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	return ret; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2012-12-18 14:23:28 -08:00
										 |  |  |  * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and | 
					
						
							|  |  |  |  * then get resheduled. When there are massive number of tasks doing page | 
					
						
							|  |  |  |  * allocation, such sleeping direct reclaimers may keep piling up on each CPU, | 
					
						
							|  |  |  |  * the LRU list will go small and be scanned faster than necessary, leading to | 
					
						
							|  |  |  |  * unnecessary swapping, thrashing and OOM. | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  |  */ | 
					
						
							|  |  |  | static int too_many_isolated(struct zone *zone, int file, | 
					
						
							|  |  |  | 		struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long inactive, isolated; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (current_is_kswapd()) | 
					
						
							|  |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (!global_reclaim(sc)) | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (file) { | 
					
						
							|  |  |  | 		inactive = zone_page_state(zone, NR_INACTIVE_FILE); | 
					
						
							|  |  |  | 		isolated = zone_page_state(zone, NR_ISOLATED_FILE); | 
					
						
							|  |  |  | 	} else { | 
					
						
							|  |  |  | 		inactive = zone_page_state(zone, NR_INACTIVE_ANON); | 
					
						
							|  |  |  | 		isolated = zone_page_state(zone, NR_ISOLATED_ANON); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
Neil found that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular deadlock.
some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
          => fs/block code does GFP_NOIO allocation
            => enter direct reclaim again
              => too_many_isolated() true
                => waiting for others to progress, however the other
                   tasks may be circular waiting for the FS lock..
The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by lowering the throttle threshold for the
latter.
Allowing ~1/8 isolated pages in normal is large enough.  For example, for
a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
logical CPUs per NUMA node).  So it's not likely some CPU goes idle
waiting (when it could make progress) because of this limit: there are
much more sleeping reclaim tasks than the number of CPU, so the task may
well be blocked by some low level queue/lock anyway.
Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
 They will be blocked only when there are too many concurrent !GFP_IOFS
reclaims, however that's very unlikely because the IO-less direct reclaims
is able to progress much more faster, and they won't deadlock each other.
The threshold is raised high enough for them, so that there can be
sufficient parallel progress of !GFP_IOFS reclaims.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: NeilBrown <neilb@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-12-18 14:23:31 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they | 
					
						
							|  |  |  | 	 * won't get blocked by normal direct-reclaimers, forming a circular | 
					
						
							|  |  |  | 	 * deadlock. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) | 
					
						
							|  |  |  | 		inactive >>= 3; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  | 	return isolated > inactive; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | static noinline_for_stack void | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list) | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; | 
					
						
							|  |  |  | 	struct zone *zone = lruvec_zone(lruvec); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 	LIST_HEAD(pages_to_free); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Put back any unfreeable pages. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	while (!list_empty(page_list)) { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 		struct page *page = lru_to_page(page_list); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		int lru; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		VM_BUG_ON(PageLRU(page)); | 
					
						
							|  |  |  | 		list_del(&page->lru); | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 		if (unlikely(!page_evictable(page))) { | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 			spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 			putback_lru_page(page); | 
					
						
							|  |  |  | 			spin_lock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		lruvec = mem_cgroup_page_lruvec(page, zone); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-17 14:42:19 -08:00
										 |  |  | 		SetPageLRU(page); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		lru = page_lru(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		add_page_to_lru_list(page, lruvec, lru); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		if (is_active_lru(lru)) { | 
					
						
							|  |  |  | 			int file = is_file_lru(lru); | 
					
						
							| 
									
										
										
										
											2011-01-13 15:47:13 -08:00
										 |  |  | 			int numpages = hpage_nr_pages(page); | 
					
						
							|  |  |  | 			reclaim_stat->recent_rotated[file] += numpages; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 		if (put_page_testzero(page)) { | 
					
						
							|  |  |  | 			__ClearPageLRU(page); | 
					
						
							|  |  |  | 			__ClearPageActive(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			del_page_from_lru_list(page, lruvec, lru); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			if (unlikely(PageCompound(page))) { | 
					
						
							|  |  |  | 				spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 				(*get_compound_page_dtor(page))(page); | 
					
						
							|  |  |  | 				spin_lock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 			} else | 
					
						
							|  |  |  | 				list_add(&page->lru, &pages_to_free); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * To save our caller's stack, now use input list for pages to free. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	list_splice(&pages_to_free, page_list); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
											
												[PATCH] vmscan: rename functions
We have:
	try_to_free_pages
	->shrink_caches(struct zone **zones, ..)
	  ->shrink_zone(struct zone *, ...)
	    ->shrink_cache(struct zone *, ...)
	      ->shrink_list(struct list_head *, ...)
	    ->refill_inactive_list((struct zone *, ...)
which is fairly irrational.
Rename things so that we have
 	try_to_free_pages
 	->shrink_zones(struct zone **zones, ..)
 	  ->shrink_zone(struct zone *, ...)
 	    ->shrink_inactive_list(struct zone *, ...)
 	      ->shrink_page_list(struct list_head *, ...)
	    ->shrink_active_list(struct zone *, ...)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-03-22 00:08:21 -08:00
										 |  |  |  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number | 
					
						
							|  |  |  |  * of reclaimed pages | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | static noinline_for_stack unsigned long | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		     struct scan_control *sc, enum lru_list lru) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	LIST_HEAD(page_list); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:28 -07:00
										 |  |  | 	unsigned long nr_scanned; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:20 -08:00
										 |  |  | 	unsigned long nr_reclaimed = 0; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:28 -07:00
										 |  |  | 	unsigned long nr_taken; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 	unsigned long nr_dirty = 0; | 
					
						
							|  |  |  | 	unsigned long nr_congested = 0; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 	unsigned long nr_unqueued_dirty = 0; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 	unsigned long nr_writeback = 0; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 	unsigned long nr_immediate = 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:54 -07:00
										 |  |  | 	isolate_mode_t isolate_mode = 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:53 -07:00
										 |  |  | 	int file = is_file_lru(lru); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	struct zone *zone = lruvec_zone(lruvec); | 
					
						
							|  |  |  | 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:31:40 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  | 	while (unlikely(too_many_isolated(zone, file, sc))) { | 
					
						
							| 
									
										
										
										
											2009-10-26 16:49:35 -07:00
										 |  |  | 		congestion_wait(BLK_RW_ASYNC, HZ/10); | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:38 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/* We are about to die and free our memory. Return now. */ | 
					
						
							|  |  |  | 		if (fatal_signal_pending(current)) | 
					
						
							|  |  |  | 			return SWAP_CLUSTER_MAX; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	lru_add_drain(); | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	if (!sc->may_unmap) | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:48 -07:00
										 |  |  | 		isolate_mode |= ISOLATE_UNMAPPED; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 	if (!sc->may_writepage) | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:48 -07:00
										 |  |  | 		isolate_mode |= ISOLATE_CLEAN; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	spin_lock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:58 -07:00
										 |  |  | 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, | 
					
						
							|  |  |  | 				     &nr_scanned, sc, isolate_mode, lru); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:59 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken); | 
					
						
							|  |  |  | 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc)) { | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:28 -07:00
										 |  |  | 		zone->pages_scanned += nr_scanned; | 
					
						
							|  |  |  | 		if (current_is_kswapd()) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:28 -07:00
										 |  |  | 		else | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:28 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:02 -07:00
										 |  |  | 	spin_unlock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:02 -07:00
										 |  |  | 	if (nr_taken == 0) | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:30 -07:00
										 |  |  | 		return 0; | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:31:55 -07:00
										 |  |  | 	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 				&nr_dirty, &nr_unqueued_dirty, &nr_congested, | 
					
						
							|  |  |  | 				&nr_writeback, &nr_immediate, | 
					
						
							|  |  |  | 				false); | 
					
						
							| 
									
										
										
										
											2007-08-22 14:01:26 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 	spin_lock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:59 -07:00
										 |  |  | 	reclaim_stat->recent_scanned[file] += nr_taken; | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:02 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-04-25 16:01:48 -07:00
										 |  |  | 	if (global_reclaim(sc)) { | 
					
						
							|  |  |  | 		if (current_is_kswapd()) | 
					
						
							|  |  |  | 			__count_zone_vm_events(PGSTEAL_KSWAPD, zone, | 
					
						
							|  |  |  | 					       nr_reclaimed); | 
					
						
							|  |  |  | 		else | 
					
						
							|  |  |  | 			__count_zone_vm_events(PGSTEAL_DIRECT, zone, | 
					
						
							|  |  |  | 					       nr_reclaimed); | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2006-01-06 00:11:20 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 	putback_inactive_pages(lruvec, &page_list); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:59 -07:00
										 |  |  | 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:07 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	free_hot_cold_page_list(&page_list, 1); | 
					
						
							| 
									
										
											  
											
												tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM.  There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing.  This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list.  For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series.  It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat.  As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable.  It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait().  wait_iff_congested() only sleeps if there is a BDI
that is currently congested.  Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones.  It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP
Each used a single disk and the onboard IO controller.  Dirty ratio was
left at 20.  I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short.  Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3:   Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
	compile-based benchmark. Smoke test performance
sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.
stress-highalloc
	Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine.  sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
                                      traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
These are based on the raw figures taken from /proc/vmstat.  It's a rough
measure of reclaim activity.  Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion.  In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
                                traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims                                443        273        513       1568
Direct reclaim pages scanned                305968     280402     600825     957933
Direct reclaim pages reclaimed               43503      19005      30327     117191
Direct reclaim write file async I/O              0          0          0          0
Direct reclaim write anon async I/O              0          3          4         12
Direct reclaim write file sync I/O               0          0          0          0
Direct reclaim write anon sync I/O               0          0          0          0
Wake kswapd requests                        187649     132338     191695     267701
Kswapd wakeups                                   3          1          4          1
Kswapd pages scanned                       4599269    4454162    4296815    3891906
Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
Kswapd reclaim write file async I/O              1          0          1          1
Kswapd reclaim write anon async I/O             59        187         41        222
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
Total pages scanned                        4905237   4734564   4897640   4849839
Total pages reclaimed                      2339450   2447439   2430145   2436897
%age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same.  In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                87        196         64          0
Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
Direct full   congest     waited                72        145         53          0
Direct number conditional waited                 0          0        324       1315
Direct time   conditional waited               0ms        0ms        0ms        0ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited                20         10         15          7
KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
KSwapd full   congest     waited                10          4          6          2
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims                                977       2709       2098       5136
Direct reclaim pages scanned                629825     963814    1063938    1711935
Direct reclaim pages reclaimed               75550     242538     150904     387647
Direct reclaim write file async I/O              0          0          0          2
Direct reclaim write anon async I/O              0         10          0          4
Direct reclaim write file sync I/O               0          0          0          0
Direct reclaim write anon sync I/O               0          0          0          0
Wake kswapd requests                        392119    1201712     571935     571921
Kswapd wakeups                                   3          2          3          3
Kswapd pages scanned                       4601307    4128076    3912317    3377165
Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
Kswapd reclaim write file async I/O             20          1          1          1
Kswapd reclaim write anon async I/O             57        132         11        121
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
Total pages scanned                        5231132   5091890   4976255   5089100
Total pages reclaimed                      2508073   2561335   2463577   2532263
%age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar.  The ratio of scanning/reclaimed
remains roughly similar.  The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest     waited               845       1312        104          0
Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
Direct full   congest     waited               745       1105         72          0
Direct number conditional waited                 0          0       1322       2935
Direct time   conditional waited               0ms        0ms       12ms      312ms
Direct full   conditional waited                 0          0          0          3
KSwapd number congest     waited                39        102         75         63
KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
KSwapd full   congest     waited                20         48         46         25
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches.  The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
Success figures across the board are broadly similar.
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                               1045        944        886        887
Direct reclaim pages scanned                135091     119604     109382     101019
Direct reclaim pages reclaimed               88599      47535      47863      46671
Direct reclaim write file async I/O            494        283        465        280
Direct reclaim write anon async I/O          29357      13710      16656      13462
Direct reclaim write file sync I/O             154          2          2          3
Direct reclaim write anon sync I/O           14594        571        509        561
Wake kswapd requests                          7491        933        872        892
Kswapd wakeups                                 814        778        731        780
Kswapd pages scanned                       7290822   15341158   11916436   13703442
Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
Kswapd reclaim write file async I/O          91975      32317      28022      29628
Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
Total pages scanned                        7425913  15460762  12025818  13804461
Total pages reclaimed                      3675935   3190031   3142255   3233822
%age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
%age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
%age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency.  The Scanned/written ratios also look much improved.  With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1333        204        169          4
Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
Direct full   congest     waited               756         92         69          2
Direct number conditional waited                 0          0         26        186
Direct time   conditional waited               0ms        0ms        0ms     2504ms
Direct full   conditional waited                 0          0          0         25
KSwapd number congest     waited                 4        395        227        282
KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
KSwapd full   congest     waited                 3        232         98        176
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
KSwapd full   conditional waited               318          0        312          9
Overall, the time spent speeping is reduced.  kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not.  Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                                499        505        564        509
Direct reclaim pages scanned                223478      41898      51818      45605
Direct reclaim pages reclaimed              137730      21148      27161      23455
Direct reclaim write file async I/O            399        136        162        136
Direct reclaim write anon async I/O          46977       2865       4686       3998
Direct reclaim write file sync I/O              29          0          1          3
Direct reclaim write anon sync I/O           31023        159        237        239
Wake kswapd requests                           420        351        360        326
Kswapd wakeups                                 185        294        249        277
Kswapd pages scanned                      15703488   16392500   17821724   17598737
Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
Kswapd reclaim write file async I/O         159938      18400      18717      13473
Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
Total pages scanned                       15926966  16434398  17873542  17644342
Total pages reclaimed                      5946196   2930006   3166547   3168890
%age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
%age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
%age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved.  The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               725        303        126          3
Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
Direct full   congest     waited               487        190         52          3
Direct number conditional waited                 0          0        200        301
Direct time   conditional waited               0ms        0ms        0ms     1904ms
Direct full   conditional waited                 0          0          0         19
KSwapd number congest     waited                 0          2         23          4
KSwapd time   congest     waited               0ms      200ms      420ms      404ms
KSwapd full   congest     waited                 0          2          2          4
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
The time to complete this test goes way down.  With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately.  It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:40 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If reclaim is isolating dirty pages under writeback, it implies | 
					
						
							|  |  |  | 	 * that the long-lived page allocation rate is exceeding the page | 
					
						
							|  |  |  | 	 * laundering rate. Either the global limits are not being effective | 
					
						
							|  |  |  | 	 * at throttling processes due to the page distribution throughout | 
					
						
							|  |  |  | 	 * zones or there is heavy usage of a slow backing device. The | 
					
						
							|  |  |  | 	 * only option is to throttle from reclaim context which is not ideal | 
					
						
							|  |  |  | 	 * as there is no guarantee the dirtying process is throttled in the | 
					
						
							|  |  |  | 	 * same way balance_dirty_pages() manages. | 
					
						
							|  |  |  | 	 * | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 	 * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number | 
					
						
							|  |  |  | 	 * of pages under pages flagged for immediate reclaim and stall if any | 
					
						
							|  |  |  | 	 * are encountered in the nr_immediate check below. | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2013-07-08 16:00:25 -07:00
										 |  |  | 	if (nr_writeback && nr_writeback == nr_taken) | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 		zone_set_flag(zone, ZONE_WRITEBACK); | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:56 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 	 * memcg will stall in page writeback so only consider forcibly | 
					
						
							|  |  |  | 	 * stalling for global reclaim | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 	if (global_reclaim(sc)) { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Tag a zone as congested if all the dirty pages scanned were | 
					
						
							|  |  |  | 		 * backed by a congested BDI and wait_iff_congested will stall. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (nr_dirty && nr_dirty == nr_congested) | 
					
						
							|  |  |  | 			zone_set_flag(zone, ZONE_CONGESTED); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:58 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If dirty pages are scanned that are not queued for IO, it | 
					
						
							|  |  |  | 		 * implies that flushers are not keeping up. In this case, flag | 
					
						
							|  |  |  | 		 * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing | 
					
						
							|  |  |  | 		 * pages from reclaim context. It will forcibly stall in the | 
					
						
							|  |  |  | 		 * next check. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (nr_unqueued_dirty == nr_taken) | 
					
						
							|  |  |  | 			zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * In addition, if kswapd scans pages marked marked for | 
					
						
							|  |  |  | 		 * immediate reclaim and under writeback (nr_immediate), it | 
					
						
							|  |  |  | 		 * implies that pages are cycling through the LRU faster than | 
					
						
							|  |  |  | 		 * they are written so also forcibly stall. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (nr_unqueued_dirty == nr_taken || nr_immediate) | 
					
						
							|  |  |  | 			congestion_wait(BLK_RW_ASYNC, HZ/10); | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems.  First and foremost, it's possible for pages
under writeback to be freed which will lead to badness.  Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed.  In some cases this results in
increased read IO to re-read data from disk.  Third, more pages were
being written from kswapd context which can adversly affect IO
performance.  Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3).  This disconnect confuses the reclaim stalling logic.  This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests.  memcachetest benchmarks how many operations/second memcached can
service.  It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress.  The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.
2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages
3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.
                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:57 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:02:02 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Stall direct reclaim for IO completions if underlying BDIs or zone | 
					
						
							|  |  |  | 	 * is congested. Allow kswapd to continue until it starts encountering | 
					
						
							|  |  |  | 	 * unqueued dirty pages or cycling through the LRU too quickly. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!sc->hibernation_mode && !current_is_kswapd()) | 
					
						
							|  |  |  | 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												tracing, vmscan: add trace events for LRU list shrinking
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM.  There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing.  This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list.  For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series.  It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat.  As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable.  It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait().  wait_iff_congested() only sleeps if there is a BDI
that is currently congested.  Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones.  It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP
Each used a single disk and the onboard IO controller.  Dirty ratio was
left at 20.  I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short.  Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3:   Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
	compile-based benchmark. Smoke test performance
sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.
stress-highalloc
	Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine.  sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
                                      traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
These are based on the raw figures taken from /proc/vmstat.  It's a rough
measure of reclaim activity.  Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion.  In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
                                traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims                                443        273        513       1568
Direct reclaim pages scanned                305968     280402     600825     957933
Direct reclaim pages reclaimed               43503      19005      30327     117191
Direct reclaim write file async I/O              0          0          0          0
Direct reclaim write anon async I/O              0          3          4         12
Direct reclaim write file sync I/O               0          0          0          0
Direct reclaim write anon sync I/O               0          0          0          0
Wake kswapd requests                        187649     132338     191695     267701
Kswapd wakeups                                   3          1          4          1
Kswapd pages scanned                       4599269    4454162    4296815    3891906
Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
Kswapd reclaim write file async I/O              1          0          1          1
Kswapd reclaim write anon async I/O             59        187         41        222
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
Total pages scanned                        4905237   4734564   4897640   4849839
Total pages reclaimed                      2339450   2447439   2430145   2436897
%age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same.  In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                87        196         64          0
Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
Direct full   congest     waited                72        145         53          0
Direct number conditional waited                 0          0        324       1315
Direct time   conditional waited               0ms        0ms        0ms        0ms
Direct full   conditional waited                 0          0          0          0
KSwapd number congest     waited                20         10         15          7
KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
KSwapd full   congest     waited                10          4          6          2
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims                                977       2709       2098       5136
Direct reclaim pages scanned                629825     963814    1063938    1711935
Direct reclaim pages reclaimed               75550     242538     150904     387647
Direct reclaim write file async I/O              0          0          0          2
Direct reclaim write anon async I/O              0         10          0          4
Direct reclaim write file sync I/O               0          0          0          0
Direct reclaim write anon sync I/O               0          0          0          0
Wake kswapd requests                        392119    1201712     571935     571921
Kswapd wakeups                                   3          2          3          3
Kswapd pages scanned                       4601307    4128076    3912317    3377165
Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
Kswapd reclaim write file async I/O             20          1          1          1
Kswapd reclaim write anon async I/O             57        132         11        121
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
Total pages scanned                        5231132   5091890   4976255   5089100
Total pages reclaimed                      2508073   2561335   2463577   2532263
%age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar.  The ratio of scanning/reclaimed
remains roughly similar.  The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest     waited               845       1312        104          0
Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
Direct full   congest     waited               745       1105         72          0
Direct number conditional waited                 0          0       1322       2935
Direct time   conditional waited               0ms        0ms       12ms      312ms
Direct full   conditional waited                 0          0          0          3
KSwapd number congest     waited                39        102         75         63
KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
KSwapd full   congest     waited                20         48         46         25
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches.  The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
Success figures across the board are broadly similar.
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                               1045        944        886        887
Direct reclaim pages scanned                135091     119604     109382     101019
Direct reclaim pages reclaimed               88599      47535      47863      46671
Direct reclaim write file async I/O            494        283        465        280
Direct reclaim write anon async I/O          29357      13710      16656      13462
Direct reclaim write file sync I/O             154          2          2          3
Direct reclaim write anon sync I/O           14594        571        509        561
Wake kswapd requests                          7491        933        872        892
Kswapd wakeups                                 814        778        731        780
Kswapd pages scanned                       7290822   15341158   11916436   13703442
Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
Kswapd reclaim write file async I/O          91975      32317      28022      29628
Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
Total pages scanned                        7425913  15460762  12025818  13804461
Total pages reclaimed                      3675935   3190031   3142255   3233822
%age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
%age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
%age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency.  The Scanned/written ratios also look much improved.  With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1333        204        169          4
Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
Direct full   congest     waited               756         92         69          2
Direct number conditional waited                 0          0         26        186
Direct time   conditional waited               0ms        0ms        0ms     2504ms
Direct full   conditional waited                 0          0          0         25
KSwapd number congest     waited                 4        395        227        282
KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
KSwapd full   congest     waited                 3        232         98        176
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
KSwapd full   conditional waited               318          0        312          9
Overall, the time spent speeping is reduced.  kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not.  Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                                499        505        564        509
Direct reclaim pages scanned                223478      41898      51818      45605
Direct reclaim pages reclaimed              137730      21148      27161      23455
Direct reclaim write file async I/O            399        136        162        136
Direct reclaim write anon async I/O          46977       2865       4686       3998
Direct reclaim write file sync I/O              29          0          1          3
Direct reclaim write anon sync I/O           31023        159        237        239
Wake kswapd requests                           420        351        360        326
Kswapd wakeups                                 185        294        249        277
Kswapd pages scanned                      15703488   16392500   17821724   17598737
Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
Kswapd reclaim write file async I/O         159938      18400      18717      13473
Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
Kswapd reclaim write file sync I/O               0          0          0          0
Kswapd reclaim write anon sync I/O               0          0          0          0
Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
Total pages scanned                       15926966  16434398  17873542  17644342
Total pages reclaimed                      5946196   2930006   3166547   3168890
%age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
%age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
%age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved.  The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               725        303        126          3
Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
Direct full   congest     waited               487        190         52          3
Direct number conditional waited                 0          0        200        301
Direct time   conditional waited               0ms        0ms        0ms     1904ms
Direct full   conditional waited                 0          0          0         19
KSwapd number congest     waited                 0          2         23          4
KSwapd time   congest     waited               0ms      200ms      420ms      404ms
KSwapd full   congest     waited                 0          2          2          4
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited                 0          0          0          0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
The time to complete this test goes way down.  With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately.  It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-10-26 14:21:40 -07:00
										 |  |  | 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, | 
					
						
							|  |  |  | 		zone_idx(zone), | 
					
						
							|  |  |  | 		nr_scanned, nr_reclaimed, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		sc->priority, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | 		trace_shrink_flags(file)); | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:20 -08:00
										 |  |  | 	return nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * This moves pages from the active list to the inactive list. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * We move them the other way if the page is referenced by one or more | 
					
						
							|  |  |  |  * processes, from rmap. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If the pages are mostly unmapped, the processing is fast and it is | 
					
						
							|  |  |  |  * appropriate to hold zone->lru_lock across the whole operation.  But if | 
					
						
							|  |  |  |  * the pages are mapped, the processing is slow (page_referenced()) so we | 
					
						
							|  |  |  |  * should drop zone->lru_lock around each page.  It's impossible to balance | 
					
						
							|  |  |  |  * this, so instead we remove the pages from the LRU while processing them. | 
					
						
							|  |  |  |  * It is safe to rely on PG_active against the non-LRU pages in here because | 
					
						
							|  |  |  |  * nobody will play with that bit on a non-LRU page. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * The downside is that we have to touch page->_count against each page. | 
					
						
							|  |  |  |  * But we had to alter page->flags anyway. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | static void move_active_pages_to_lru(struct lruvec *lruvec, | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 				     struct list_head *list, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 				     struct list_head *pages_to_free, | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 				     enum lru_list lru) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	struct zone *zone = lruvec_zone(lruvec); | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 	unsigned long pgmoved = 0; | 
					
						
							|  |  |  | 	struct page *page; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	int nr_pages; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	while (!list_empty(list)) { | 
					
						
							|  |  |  | 		page = lru_to_page(list); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		lruvec = mem_cgroup_page_lruvec(page, zone); | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		VM_BUG_ON(PageLRU(page)); | 
					
						
							|  |  |  | 		SetPageLRU(page); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		nr_pages = hpage_nr_pages(page); | 
					
						
							|  |  |  | 		mem_cgroup_update_lru_size(lruvec, lru, nr_pages); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:15 -08:00
										 |  |  | 		list_move(&page->lru, &lruvec->lists[lru]); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		pgmoved += nr_pages; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 		if (put_page_testzero(page)) { | 
					
						
							|  |  |  | 			__ClearPageLRU(page); | 
					
						
							|  |  |  | 			__ClearPageActive(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			del_page_from_lru_list(page, lruvec, lru); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			if (unlikely(PageCompound(page))) { | 
					
						
							|  |  |  | 				spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 				(*get_compound_page_dtor(page))(page); | 
					
						
							|  |  |  | 				spin_lock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 			} else | 
					
						
							|  |  |  | 				list_add(&page->lru, pages_to_free); | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:13 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved); | 
					
						
							|  |  |  | 	if (!is_active_lru(lru)) | 
					
						
							|  |  |  | 		__count_vm_events(PGDEACTIVATE, pgmoved); | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  | static void shrink_active_list(unsigned long nr_to_scan, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 			       struct lruvec *lruvec, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 			       struct scan_control *sc, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 			       enum lru_list lru) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:35 -07:00
										 |  |  | 	unsigned long nr_taken; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  | 	unsigned long nr_scanned; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:05 -07:00
										 |  |  | 	unsigned long vm_flags; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	LIST_HEAD(l_hold);	/* The pages which were snipped off */ | 
					
						
							| 
									
										
											  
											
												vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
  Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
  Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
	(1) memory mapped executables
	(2) and their anonymous pages
	(3) and other files
	(4) and the dcache/icache/.. slabs
while the least important data are
	(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list.  Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
	before patch: guaranteed to be cached if reference intervals < I
	after  patch: guaranteed to be cached if reference intervals < I+A
		      (except when randomly reclaimed by the lumpy reclaim)
where
	A = time to fully scan the   active file LRU
	I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic.  But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
  after starting each new program.
1.3) progress timing (seconds)
  before       after    programs
    0.02        0.02    N xeyes
    0.75        0.76    N firefox
    2.02        1.88    N nautilus
    3.36        3.17    N nautilus --browser
    5.26        4.89    N gthumb
    7.12        6.47    N gedit
    9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
   13.58       12.55    N xterm
   15.87       14.57    N mlterm
   18.63       17.06    N gnome-terminal
   21.16       18.90    N urxvt
   26.24       23.48    N gnome-system-monitor
   28.72       26.52    N gnome-help
   32.15       29.65    N gnome-dictionary
   39.66       36.12    N /usr/games/sol
   43.16       39.27    N /usr/games/gnometris
   48.65       42.56    N /usr/games/gnect
   53.31       47.03    N /usr/games/gtali
   58.60       52.05    N /usr/games/iagno
   65.77       55.42    N /usr/games/gnotravex
   70.76       61.47    N /usr/games/mahjongg
   76.15       67.11    N /usr/games/gnome-sudoku
   86.32       75.15    N /usr/games/glines
   92.21       79.70    N /usr/games/glchess
  103.79       88.48    N /usr/games/gnomine
  113.84       96.51    N /usr/games/gnotski
  124.40      102.19    N /usr/games/gnibbles
  137.41      114.93    N /usr/games/gnobots2
  155.53      125.02    N /usr/games/blackjack
  179.85      135.11    N /usr/games/same-gnome
  224.49      154.50    N /usr/bin/gnome-window-properties
  248.44      162.09    N /usr/bin/gnome-default-applications-properties
  282.62      173.29    N /usr/bin/gnome-at-properties
  323.72      188.21    N /usr/bin/gnome-typing-monitor
  363.99      199.93    N /usr/bin/gnome-at-visual
  394.21      206.95    N /usr/bin/gnome-sound-properties
  435.14      224.49    N /usr/bin/gnome-at-mobility
  463.05      234.11    N /usr/bin/gnome-keybinding-properties
  503.75      248.59    N /usr/bin/gnome-about-me
  554.00      276.27    N /usr/bin/gnome-display-properties
  615.48      304.39    N /usr/bin/gnome-network-preferences
  693.03      342.01    N /usr/bin/gnome-mouse-properties
  759.90      388.58    N /usr/bin/gnome-appearance-properties
  937.90      508.47    N /usr/bin/gnome-control-center
 1109.75      587.57    N /usr/bin/gnome-keyboard-properties
 1399.05      758.16    N : oocalc
 1524.64      830.03    N : oodraw
 1684.31      900.03    N : ooimpress
 1874.04      993.91    N : oomath
 2115.12     1081.89    N : ooweb
 2369.02     1161.99    N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
                            before    after
 nr_free_pages              1293      3898
 nr_inactive_anon           59956     53460
 nr_active_anon             26815     30026
 nr_inactive_file           2657      3218
 nr_active_file             2019      2806
 nr_unevictable             4         4
 nr_mlock                   4         4
 nr_anon_pages              26706     27859
*nr_mapped                  3542      4469
 nr_file_pages              72232     67681
 nr_dirty                   1         0
 nr_writeback               123       19
 nr_slab_reclaimable        3375      3534
 nr_slab_unreclaimable      11405     10665
 nr_page_table_pages        8106      7864
 nr_unstable                0         0
 nr_bounce                  0         0
*nr_vmscan_write            394776    230839
 nr_writeback_temp          0         0
 numa_hit                   6843353   3318676
 numa_miss                  0         0
 numa_foreign               0         0
 numa_interleave            1719      1719
 numa_local                 6843353   3318676
 numa_other                 0         0
*pgpgin                     5954683   2057175
*pgpgout                    1578276   922744
*pswpin                     1486615   512238
*pswpout                    394568    230685
 pgalloc_dma                277432    56602
 pgalloc_dma32              6769477   3310348
 pgalloc_normal             0         0
 pgalloc_movable            0         0
 pgfree                     7048396   3371118
 pgactivate                 2036343   1471492
 pgdeactivate               2189691   1612829
 pgfault                    3702176   3100702
*pgmajfault                 452116    201343
 pgrefill_dma               12185     7127
 pgrefill_dma32             334384    653703
 pgrefill_normal            0         0
 pgrefill_movable           0         0
 pgsteal_dma                74214     22179
 pgsteal_dma32              3334164   1638029
 pgsteal_normal             0         0
 pgsteal_movable            0         0
 pgscan_kswapd_dma          1081421   1216199
 pgscan_kswapd_dma32        58979118  46002810
 pgscan_kswapd_normal       0         0
 pgscan_kswapd_movable      0         0
 pgscan_direct_dma          2015438   1086109
 pgscan_direct_dma32        55787823  36101597
 pgscan_direct_normal       0         0
 pgscan_direct_movable      0         0
 pginodesteal               3461      7281
 slabs_scanned              564864    527616
 kswapd_steal               2889797   1448082
 kswapd_inodesteal          14827     14835
 pageoutrun                 43459     21562
 allocstall                 9653      4032
 pgrotated                  384216    228631
1.5) free numbers at the end of the tests
before patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        467          7          0          0        236
                -/+ buffers/cache:        230        243
                Swap:         1023        418        605
after patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        457         16          0          0        236
                -/+ buffers/cache:        221        253
                Swap:         1023        404        619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
        for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
        iotrace.rb --load pattern-hot-10 --play /b/sparse
	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin:     shortly after the big read IO starts;
(2) end:       just before the big read IO stops;
(3) restore:   the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
               windows to restore their working set.
2.3) console mode results
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.29 VM_EXEC protection ON:
begin:       2481             2237             8694              630                0           574299
end:          275           231976           233914              633           776271         20933042
restore:      370           232154           234524              691           777183         20958453
2.6.29 VM_EXEC protection ON (second run):
begin:       2434             2237             8493              629                0           574195
end:          284           231970           233536              632           771918         20896129
restore:      399           232218           234789              690           774526         20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       2479             2344             9659              210                0           579643
end:          284           232010           234142              260           772776         20917184
restore:      379           232159           234371              301           774888         20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
  I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
  That's a huge improvement - which means with the VM_EXEC protection logic,
  active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
  under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
  That roughly means the active mmap pages get 20.8 more chances to get
  re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
  dropped pages are mostly inactive ones. The patch has almost no impact in
  this aspect, that means it won't unnecessarily increase memory pressure.
  (In contrast, your 20% mmap protection ratio will keep them all, and
  therefore eliminate the extra 41 major faults to restore working set
  of zsh etc.)
The iotrace.rb read throughput is
	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin:       9740             8920            64075              561                0           678360
end:          768           218254           220029              565           798953         21057006
restore:      857           218543           220987              606           799462         21075710
restore X:   2414           218560           225344              797           799462         21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       9368             5035            26389              554                0           633391
end:          770           218449           221230              661           646472         17832500
restore:     1113           218466           220978              710           649881         17905235
restore X:   2687           218650           225484              947           802700         21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
  during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
  during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:12 -07:00
										 |  |  | 	LIST_HEAD(l_active); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:14 -07:00
										 |  |  | 	LIST_HEAD(l_inactive); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	struct page *page; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:35 -07:00
										 |  |  | 	unsigned long nr_rotated = 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:54 -07:00
										 |  |  | 	isolate_mode_t isolate_mode = 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:53 -07:00
										 |  |  | 	int file = is_file_lru(lru); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	struct zone *zone = lruvec_zone(lruvec); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	lru_add_drain(); | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	if (!sc->may_unmap) | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:48 -07:00
										 |  |  | 		isolate_mode |= ISOLATE_UNMAPPED; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 	if (!sc->may_writepage) | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:48 -07:00
										 |  |  | 		isolate_mode |= ISOLATE_CLEAN; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:06:55 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	spin_lock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:15 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:58 -07:00
										 |  |  | 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, | 
					
						
							|  |  |  | 				     &nr_scanned, sc, isolate_mode, lru); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc)) | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  | 		zone->pages_scanned += nr_scanned; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:02:56 -07:00
										 |  |  | 	reclaim_stat->recent_scanned[file] += nr_taken; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:06 -08:00
										 |  |  | 	__count_zone_vm_events(PGREFILL, zone, nr_scanned); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:53 -07:00
										 |  |  | 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken); | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:37 -07:00
										 |  |  | 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	while (!list_empty(&l_hold)) { | 
					
						
							|  |  |  | 		cond_resched(); | 
					
						
							|  |  |  | 		page = lru_to_page(&l_hold); | 
					
						
							|  |  |  | 		list_del(&page->lru); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:35 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 		if (unlikely(!page_evictable(page))) { | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 			putback_lru_page(page); | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:00 -07:00
										 |  |  | 		if (unlikely(buffer_heads_over_limit)) { | 
					
						
							|  |  |  | 			if (page_has_private(page) && trylock_page(page)) { | 
					
						
							|  |  |  | 				if (page_has_private(page)) | 
					
						
							|  |  |  | 					try_to_release_page(page, 0); | 
					
						
							|  |  |  | 				unlock_page(page); | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:25 -07:00
										 |  |  | 		if (page_referenced(page, 0, sc->target_mem_cgroup, | 
					
						
							|  |  |  | 				    &vm_flags)) { | 
					
						
							| 
									
										
										
										
											2011-01-13 15:47:13 -08:00
										 |  |  | 			nr_rotated += hpage_nr_pages(page); | 
					
						
							| 
									
										
											  
											
												vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
  Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
  Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
	(1) memory mapped executables
	(2) and their anonymous pages
	(3) and other files
	(4) and the dcache/icache/.. slabs
while the least important data are
	(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list.  Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
	before patch: guaranteed to be cached if reference intervals < I
	after  patch: guaranteed to be cached if reference intervals < I+A
		      (except when randomly reclaimed by the lumpy reclaim)
where
	A = time to fully scan the   active file LRU
	I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic.  But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
  after starting each new program.
1.3) progress timing (seconds)
  before       after    programs
    0.02        0.02    N xeyes
    0.75        0.76    N firefox
    2.02        1.88    N nautilus
    3.36        3.17    N nautilus --browser
    5.26        4.89    N gthumb
    7.12        6.47    N gedit
    9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
   13.58       12.55    N xterm
   15.87       14.57    N mlterm
   18.63       17.06    N gnome-terminal
   21.16       18.90    N urxvt
   26.24       23.48    N gnome-system-monitor
   28.72       26.52    N gnome-help
   32.15       29.65    N gnome-dictionary
   39.66       36.12    N /usr/games/sol
   43.16       39.27    N /usr/games/gnometris
   48.65       42.56    N /usr/games/gnect
   53.31       47.03    N /usr/games/gtali
   58.60       52.05    N /usr/games/iagno
   65.77       55.42    N /usr/games/gnotravex
   70.76       61.47    N /usr/games/mahjongg
   76.15       67.11    N /usr/games/gnome-sudoku
   86.32       75.15    N /usr/games/glines
   92.21       79.70    N /usr/games/glchess
  103.79       88.48    N /usr/games/gnomine
  113.84       96.51    N /usr/games/gnotski
  124.40      102.19    N /usr/games/gnibbles
  137.41      114.93    N /usr/games/gnobots2
  155.53      125.02    N /usr/games/blackjack
  179.85      135.11    N /usr/games/same-gnome
  224.49      154.50    N /usr/bin/gnome-window-properties
  248.44      162.09    N /usr/bin/gnome-default-applications-properties
  282.62      173.29    N /usr/bin/gnome-at-properties
  323.72      188.21    N /usr/bin/gnome-typing-monitor
  363.99      199.93    N /usr/bin/gnome-at-visual
  394.21      206.95    N /usr/bin/gnome-sound-properties
  435.14      224.49    N /usr/bin/gnome-at-mobility
  463.05      234.11    N /usr/bin/gnome-keybinding-properties
  503.75      248.59    N /usr/bin/gnome-about-me
  554.00      276.27    N /usr/bin/gnome-display-properties
  615.48      304.39    N /usr/bin/gnome-network-preferences
  693.03      342.01    N /usr/bin/gnome-mouse-properties
  759.90      388.58    N /usr/bin/gnome-appearance-properties
  937.90      508.47    N /usr/bin/gnome-control-center
 1109.75      587.57    N /usr/bin/gnome-keyboard-properties
 1399.05      758.16    N : oocalc
 1524.64      830.03    N : oodraw
 1684.31      900.03    N : ooimpress
 1874.04      993.91    N : oomath
 2115.12     1081.89    N : ooweb
 2369.02     1161.99    N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
                            before    after
 nr_free_pages              1293      3898
 nr_inactive_anon           59956     53460
 nr_active_anon             26815     30026
 nr_inactive_file           2657      3218
 nr_active_file             2019      2806
 nr_unevictable             4         4
 nr_mlock                   4         4
 nr_anon_pages              26706     27859
*nr_mapped                  3542      4469
 nr_file_pages              72232     67681
 nr_dirty                   1         0
 nr_writeback               123       19
 nr_slab_reclaimable        3375      3534
 nr_slab_unreclaimable      11405     10665
 nr_page_table_pages        8106      7864
 nr_unstable                0         0
 nr_bounce                  0         0
*nr_vmscan_write            394776    230839
 nr_writeback_temp          0         0
 numa_hit                   6843353   3318676
 numa_miss                  0         0
 numa_foreign               0         0
 numa_interleave            1719      1719
 numa_local                 6843353   3318676
 numa_other                 0         0
*pgpgin                     5954683   2057175
*pgpgout                    1578276   922744
*pswpin                     1486615   512238
*pswpout                    394568    230685
 pgalloc_dma                277432    56602
 pgalloc_dma32              6769477   3310348
 pgalloc_normal             0         0
 pgalloc_movable            0         0
 pgfree                     7048396   3371118
 pgactivate                 2036343   1471492
 pgdeactivate               2189691   1612829
 pgfault                    3702176   3100702
*pgmajfault                 452116    201343
 pgrefill_dma               12185     7127
 pgrefill_dma32             334384    653703
 pgrefill_normal            0         0
 pgrefill_movable           0         0
 pgsteal_dma                74214     22179
 pgsteal_dma32              3334164   1638029
 pgsteal_normal             0         0
 pgsteal_movable            0         0
 pgscan_kswapd_dma          1081421   1216199
 pgscan_kswapd_dma32        58979118  46002810
 pgscan_kswapd_normal       0         0
 pgscan_kswapd_movable      0         0
 pgscan_direct_dma          2015438   1086109
 pgscan_direct_dma32        55787823  36101597
 pgscan_direct_normal       0         0
 pgscan_direct_movable      0         0
 pginodesteal               3461      7281
 slabs_scanned              564864    527616
 kswapd_steal               2889797   1448082
 kswapd_inodesteal          14827     14835
 pageoutrun                 43459     21562
 allocstall                 9653      4032
 pgrotated                  384216    228631
1.5) free numbers at the end of the tests
before patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        467          7          0          0        236
                -/+ buffers/cache:        230        243
                Swap:         1023        418        605
after patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        457         16          0          0        236
                -/+ buffers/cache:        221        253
                Swap:         1023        404        619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
        for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
        iotrace.rb --load pattern-hot-10 --play /b/sparse
	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin:     shortly after the big read IO starts;
(2) end:       just before the big read IO stops;
(3) restore:   the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
               windows to restore their working set.
2.3) console mode results
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.29 VM_EXEC protection ON:
begin:       2481             2237             8694              630                0           574299
end:          275           231976           233914              633           776271         20933042
restore:      370           232154           234524              691           777183         20958453
2.6.29 VM_EXEC protection ON (second run):
begin:       2434             2237             8493              629                0           574195
end:          284           231970           233536              632           771918         20896129
restore:      399           232218           234789              690           774526         20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       2479             2344             9659              210                0           579643
end:          284           232010           234142              260           772776         20917184
restore:      379           232159           234371              301           774888         20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
  I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
  That's a huge improvement - which means with the VM_EXEC protection logic,
  active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
  under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
  That roughly means the active mmap pages get 20.8 more chances to get
  re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
  dropped pages are mostly inactive ones. The patch has almost no impact in
  this aspect, that means it won't unnecessarily increase memory pressure.
  (In contrast, your 20% mmap protection ratio will keep them all, and
  therefore eliminate the extra 41 major faults to restore working set
  of zsh etc.)
The iotrace.rb read throughput is
	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin:       9740             8920            64075              561                0           678360
end:          768           218254           220029              565           798953         21057006
restore:      857           218543           220987              606           799462         21075710
restore X:   2414           218560           225344              797           799462         21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       9368             5035            26389              554                0           633391
end:          770           218449           221230              661           646472         17832500
restore:     1113           218466           220978              710           649881         17905235
restore X:   2687           218650           225484              947           802700         21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
  during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
  during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:12 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Identify referenced, file-backed active pages and | 
					
						
							|  |  |  | 			 * give them one more trip around the active list. So | 
					
						
							|  |  |  | 			 * that executable code get better chances to stay in | 
					
						
							|  |  |  | 			 * memory under moderate memory pressure.  Anon pages | 
					
						
							|  |  |  | 			 * are not likely to be evicted by use-once streaming | 
					
						
							|  |  |  | 			 * IO, plus JVM can create lots of anon VM_EXEC pages, | 
					
						
							|  |  |  | 			 * so we ignore them here. | 
					
						
							|  |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2009-10-26 16:49:53 -07:00
										 |  |  | 			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { | 
					
						
							| 
									
										
											  
											
												vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
  Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
  Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
	(1) memory mapped executables
	(2) and their anonymous pages
	(3) and other files
	(4) and the dcache/icache/.. slabs
while the least important data are
	(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list.  Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
	before patch: guaranteed to be cached if reference intervals < I
	after  patch: guaranteed to be cached if reference intervals < I+A
		      (except when randomly reclaimed by the lumpy reclaim)
where
	A = time to fully scan the   active file LRU
	I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic.  But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
  after starting each new program.
1.3) progress timing (seconds)
  before       after    programs
    0.02        0.02    N xeyes
    0.75        0.76    N firefox
    2.02        1.88    N nautilus
    3.36        3.17    N nautilus --browser
    5.26        4.89    N gthumb
    7.12        6.47    N gedit
    9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
   13.58       12.55    N xterm
   15.87       14.57    N mlterm
   18.63       17.06    N gnome-terminal
   21.16       18.90    N urxvt
   26.24       23.48    N gnome-system-monitor
   28.72       26.52    N gnome-help
   32.15       29.65    N gnome-dictionary
   39.66       36.12    N /usr/games/sol
   43.16       39.27    N /usr/games/gnometris
   48.65       42.56    N /usr/games/gnect
   53.31       47.03    N /usr/games/gtali
   58.60       52.05    N /usr/games/iagno
   65.77       55.42    N /usr/games/gnotravex
   70.76       61.47    N /usr/games/mahjongg
   76.15       67.11    N /usr/games/gnome-sudoku
   86.32       75.15    N /usr/games/glines
   92.21       79.70    N /usr/games/glchess
  103.79       88.48    N /usr/games/gnomine
  113.84       96.51    N /usr/games/gnotski
  124.40      102.19    N /usr/games/gnibbles
  137.41      114.93    N /usr/games/gnobots2
  155.53      125.02    N /usr/games/blackjack
  179.85      135.11    N /usr/games/same-gnome
  224.49      154.50    N /usr/bin/gnome-window-properties
  248.44      162.09    N /usr/bin/gnome-default-applications-properties
  282.62      173.29    N /usr/bin/gnome-at-properties
  323.72      188.21    N /usr/bin/gnome-typing-monitor
  363.99      199.93    N /usr/bin/gnome-at-visual
  394.21      206.95    N /usr/bin/gnome-sound-properties
  435.14      224.49    N /usr/bin/gnome-at-mobility
  463.05      234.11    N /usr/bin/gnome-keybinding-properties
  503.75      248.59    N /usr/bin/gnome-about-me
  554.00      276.27    N /usr/bin/gnome-display-properties
  615.48      304.39    N /usr/bin/gnome-network-preferences
  693.03      342.01    N /usr/bin/gnome-mouse-properties
  759.90      388.58    N /usr/bin/gnome-appearance-properties
  937.90      508.47    N /usr/bin/gnome-control-center
 1109.75      587.57    N /usr/bin/gnome-keyboard-properties
 1399.05      758.16    N : oocalc
 1524.64      830.03    N : oodraw
 1684.31      900.03    N : ooimpress
 1874.04      993.91    N : oomath
 2115.12     1081.89    N : ooweb
 2369.02     1161.99    N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
                            before    after
 nr_free_pages              1293      3898
 nr_inactive_anon           59956     53460
 nr_active_anon             26815     30026
 nr_inactive_file           2657      3218
 nr_active_file             2019      2806
 nr_unevictable             4         4
 nr_mlock                   4         4
 nr_anon_pages              26706     27859
*nr_mapped                  3542      4469
 nr_file_pages              72232     67681
 nr_dirty                   1         0
 nr_writeback               123       19
 nr_slab_reclaimable        3375      3534
 nr_slab_unreclaimable      11405     10665
 nr_page_table_pages        8106      7864
 nr_unstable                0         0
 nr_bounce                  0         0
*nr_vmscan_write            394776    230839
 nr_writeback_temp          0         0
 numa_hit                   6843353   3318676
 numa_miss                  0         0
 numa_foreign               0         0
 numa_interleave            1719      1719
 numa_local                 6843353   3318676
 numa_other                 0         0
*pgpgin                     5954683   2057175
*pgpgout                    1578276   922744
*pswpin                     1486615   512238
*pswpout                    394568    230685
 pgalloc_dma                277432    56602
 pgalloc_dma32              6769477   3310348
 pgalloc_normal             0         0
 pgalloc_movable            0         0
 pgfree                     7048396   3371118
 pgactivate                 2036343   1471492
 pgdeactivate               2189691   1612829
 pgfault                    3702176   3100702
*pgmajfault                 452116    201343
 pgrefill_dma               12185     7127
 pgrefill_dma32             334384    653703
 pgrefill_normal            0         0
 pgrefill_movable           0         0
 pgsteal_dma                74214     22179
 pgsteal_dma32              3334164   1638029
 pgsteal_normal             0         0
 pgsteal_movable            0         0
 pgscan_kswapd_dma          1081421   1216199
 pgscan_kswapd_dma32        58979118  46002810
 pgscan_kswapd_normal       0         0
 pgscan_kswapd_movable      0         0
 pgscan_direct_dma          2015438   1086109
 pgscan_direct_dma32        55787823  36101597
 pgscan_direct_normal       0         0
 pgscan_direct_movable      0         0
 pginodesteal               3461      7281
 slabs_scanned              564864    527616
 kswapd_steal               2889797   1448082
 kswapd_inodesteal          14827     14835
 pageoutrun                 43459     21562
 allocstall                 9653      4032
 pgrotated                  384216    228631
1.5) free numbers at the end of the tests
before patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        467          7          0          0        236
                -/+ buffers/cache:        230        243
                Swap:         1023        418        605
after patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        457         16          0          0        236
                -/+ buffers/cache:        221        253
                Swap:         1023        404        619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
        for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
        iotrace.rb --load pattern-hot-10 --play /b/sparse
	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin:     shortly after the big read IO starts;
(2) end:       just before the big read IO stops;
(3) restore:   the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
               windows to restore their working set.
2.3) console mode results
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.29 VM_EXEC protection ON:
begin:       2481             2237             8694              630                0           574299
end:          275           231976           233914              633           776271         20933042
restore:      370           232154           234524              691           777183         20958453
2.6.29 VM_EXEC protection ON (second run):
begin:       2434             2237             8493              629                0           574195
end:          284           231970           233536              632           771918         20896129
restore:      399           232218           234789              690           774526         20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       2479             2344             9659              210                0           579643
end:          284           232010           234142              260           772776         20917184
restore:      379           232159           234371              301           774888         20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
  I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
  That's a huge improvement - which means with the VM_EXEC protection logic,
  active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
  under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
  That roughly means the active mmap pages get 20.8 more chances to get
  re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
  dropped pages are mostly inactive ones. The patch has almost no impact in
  this aspect, that means it won't unnecessarily increase memory pressure.
  (In contrast, your 20% mmap protection ratio will keep them all, and
  therefore eliminate the extra 41 major faults to restore working set
  of zsh etc.)
The iotrace.rb read throughput is
	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin:       9740             8920            64075              561                0           678360
end:          768           218254           220029              565           798953         21057006
restore:      857           218543           220987              606           799462         21075710
restore X:   2414           218560           225344              797           799462         21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       9368             5035            26389              554                0           633391
end:          770           218449           221230              661           646472         17832500
restore:     1113           218466           220978              710           649881         17905235
restore X:   2687           218650           225484              947           802700         21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
  during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
  during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:12 -07:00
										 |  |  | 				list_add(&page->lru, &l_active); | 
					
						
							|  |  |  | 				continue; | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:35 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:44 -07:00
										 |  |  | 		ClearPageActive(page);	/* we are de-activating */ | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		list_add(&page->lru, &l_inactive); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-06 14:40:13 -08:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
											  
											
												vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
  Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
  Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
	(1) memory mapped executables
	(2) and their anonymous pages
	(3) and other files
	(4) and the dcache/icache/.. slabs
while the least important data are
	(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list.  Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
	before patch: guaranteed to be cached if reference intervals < I
	after  patch: guaranteed to be cached if reference intervals < I+A
		      (except when randomly reclaimed by the lumpy reclaim)
where
	A = time to fully scan the   active file LRU
	I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic.  But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
  after starting each new program.
1.3) progress timing (seconds)
  before       after    programs
    0.02        0.02    N xeyes
    0.75        0.76    N firefox
    2.02        1.88    N nautilus
    3.36        3.17    N nautilus --browser
    5.26        4.89    N gthumb
    7.12        6.47    N gedit
    9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
   13.58       12.55    N xterm
   15.87       14.57    N mlterm
   18.63       17.06    N gnome-terminal
   21.16       18.90    N urxvt
   26.24       23.48    N gnome-system-monitor
   28.72       26.52    N gnome-help
   32.15       29.65    N gnome-dictionary
   39.66       36.12    N /usr/games/sol
   43.16       39.27    N /usr/games/gnometris
   48.65       42.56    N /usr/games/gnect
   53.31       47.03    N /usr/games/gtali
   58.60       52.05    N /usr/games/iagno
   65.77       55.42    N /usr/games/gnotravex
   70.76       61.47    N /usr/games/mahjongg
   76.15       67.11    N /usr/games/gnome-sudoku
   86.32       75.15    N /usr/games/glines
   92.21       79.70    N /usr/games/glchess
  103.79       88.48    N /usr/games/gnomine
  113.84       96.51    N /usr/games/gnotski
  124.40      102.19    N /usr/games/gnibbles
  137.41      114.93    N /usr/games/gnobots2
  155.53      125.02    N /usr/games/blackjack
  179.85      135.11    N /usr/games/same-gnome
  224.49      154.50    N /usr/bin/gnome-window-properties
  248.44      162.09    N /usr/bin/gnome-default-applications-properties
  282.62      173.29    N /usr/bin/gnome-at-properties
  323.72      188.21    N /usr/bin/gnome-typing-monitor
  363.99      199.93    N /usr/bin/gnome-at-visual
  394.21      206.95    N /usr/bin/gnome-sound-properties
  435.14      224.49    N /usr/bin/gnome-at-mobility
  463.05      234.11    N /usr/bin/gnome-keybinding-properties
  503.75      248.59    N /usr/bin/gnome-about-me
  554.00      276.27    N /usr/bin/gnome-display-properties
  615.48      304.39    N /usr/bin/gnome-network-preferences
  693.03      342.01    N /usr/bin/gnome-mouse-properties
  759.90      388.58    N /usr/bin/gnome-appearance-properties
  937.90      508.47    N /usr/bin/gnome-control-center
 1109.75      587.57    N /usr/bin/gnome-keyboard-properties
 1399.05      758.16    N : oocalc
 1524.64      830.03    N : oodraw
 1684.31      900.03    N : ooimpress
 1874.04      993.91    N : oomath
 2115.12     1081.89    N : ooweb
 2369.02     1161.99    N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
                            before    after
 nr_free_pages              1293      3898
 nr_inactive_anon           59956     53460
 nr_active_anon             26815     30026
 nr_inactive_file           2657      3218
 nr_active_file             2019      2806
 nr_unevictable             4         4
 nr_mlock                   4         4
 nr_anon_pages              26706     27859
*nr_mapped                  3542      4469
 nr_file_pages              72232     67681
 nr_dirty                   1         0
 nr_writeback               123       19
 nr_slab_reclaimable        3375      3534
 nr_slab_unreclaimable      11405     10665
 nr_page_table_pages        8106      7864
 nr_unstable                0         0
 nr_bounce                  0         0
*nr_vmscan_write            394776    230839
 nr_writeback_temp          0         0
 numa_hit                   6843353   3318676
 numa_miss                  0         0
 numa_foreign               0         0
 numa_interleave            1719      1719
 numa_local                 6843353   3318676
 numa_other                 0         0
*pgpgin                     5954683   2057175
*pgpgout                    1578276   922744
*pswpin                     1486615   512238
*pswpout                    394568    230685
 pgalloc_dma                277432    56602
 pgalloc_dma32              6769477   3310348
 pgalloc_normal             0         0
 pgalloc_movable            0         0
 pgfree                     7048396   3371118
 pgactivate                 2036343   1471492
 pgdeactivate               2189691   1612829
 pgfault                    3702176   3100702
*pgmajfault                 452116    201343
 pgrefill_dma               12185     7127
 pgrefill_dma32             334384    653703
 pgrefill_normal            0         0
 pgrefill_movable           0         0
 pgsteal_dma                74214     22179
 pgsteal_dma32              3334164   1638029
 pgsteal_normal             0         0
 pgsteal_movable            0         0
 pgscan_kswapd_dma          1081421   1216199
 pgscan_kswapd_dma32        58979118  46002810
 pgscan_kswapd_normal       0         0
 pgscan_kswapd_movable      0         0
 pgscan_direct_dma          2015438   1086109
 pgscan_direct_dma32        55787823  36101597
 pgscan_direct_normal       0         0
 pgscan_direct_movable      0         0
 pginodesteal               3461      7281
 slabs_scanned              564864    527616
 kswapd_steal               2889797   1448082
 kswapd_inodesteal          14827     14835
 pageoutrun                 43459     21562
 allocstall                 9653      4032
 pgrotated                  384216    228631
1.5) free numbers at the end of the tests
before patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        467          7          0          0        236
                -/+ buffers/cache:        230        243
                Swap:         1023        418        605
after patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        457         16          0          0        236
                -/+ buffers/cache:        221        253
                Swap:         1023        404        619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
        for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
        iotrace.rb --load pattern-hot-10 --play /b/sparse
	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin:     shortly after the big read IO starts;
(2) end:       just before the big read IO stops;
(3) restore:   the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
               windows to restore their working set.
2.3) console mode results
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.29 VM_EXEC protection ON:
begin:       2481             2237             8694              630                0           574299
end:          275           231976           233914              633           776271         20933042
restore:      370           232154           234524              691           777183         20958453
2.6.29 VM_EXEC protection ON (second run):
begin:       2434             2237             8493              629                0           574195
end:          284           231970           233536              632           771918         20896129
restore:      399           232218           234789              690           774526         20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       2479             2344             9659              210                0           579643
end:          284           232010           234142              260           772776         20917184
restore:      379           232159           234371              301           774888         20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
  I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
  That's a huge improvement - which means with the VM_EXEC protection logic,
  active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
  under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
  That roughly means the active mmap pages get 20.8 more chances to get
  re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
  dropped pages are mostly inactive ones. The patch has almost no impact in
  this aspect, that means it won't unnecessarily increase memory pressure.
  (In contrast, your 20% mmap protection ratio will keep them all, and
  therefore eliminate the extra 41 major faults to restore working set
  of zsh etc.)
The iotrace.rb read throughput is
	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin:       9740             8920            64075              561                0           678360
end:          768           218254           220029              565           798953         21057006
restore:      857           218543           220987              606           799462         21075710
restore X:   2414           218560           225344              797           799462         21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       9368             5035            26389              554                0           633391
end:          770           218449           221230              661           646472         17832500
restore:     1113           218466           220978              710           649881         17905235
restore X:   2687           218650           225484              947           802700         21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
  during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
  during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:12 -07:00
										 |  |  | 	 * Move pages back to the lru list. | 
					
						
							| 
									
										
										
										
											2009-01-06 14:40:13 -08:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2008-12-01 03:00:35 +01:00
										 |  |  | 	spin_lock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:34 -07:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
											  
											
												vmscan: make mapped executable pages the first class citizen
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
  Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
  Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
	(1) memory mapped executables
	(2) and their anonymous pages
	(3) and other files
	(4) and the dcache/icache/.. slabs
while the least important data are
	(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list.  Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
	before patch: guaranteed to be cached if reference intervals < I
	after  patch: guaranteed to be cached if reference intervals < I+A
		      (except when randomly reclaimed by the lumpy reclaim)
where
	A = time to fully scan the   active file LRU
	I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic.  But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
  after starting each new program.
1.3) progress timing (seconds)
  before       after    programs
    0.02        0.02    N xeyes
    0.75        0.76    N firefox
    2.02        1.88    N nautilus
    3.36        3.17    N nautilus --browser
    5.26        4.89    N gthumb
    7.12        6.47    N gedit
    9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
   13.58       12.55    N xterm
   15.87       14.57    N mlterm
   18.63       17.06    N gnome-terminal
   21.16       18.90    N urxvt
   26.24       23.48    N gnome-system-monitor
   28.72       26.52    N gnome-help
   32.15       29.65    N gnome-dictionary
   39.66       36.12    N /usr/games/sol
   43.16       39.27    N /usr/games/gnometris
   48.65       42.56    N /usr/games/gnect
   53.31       47.03    N /usr/games/gtali
   58.60       52.05    N /usr/games/iagno
   65.77       55.42    N /usr/games/gnotravex
   70.76       61.47    N /usr/games/mahjongg
   76.15       67.11    N /usr/games/gnome-sudoku
   86.32       75.15    N /usr/games/glines
   92.21       79.70    N /usr/games/glchess
  103.79       88.48    N /usr/games/gnomine
  113.84       96.51    N /usr/games/gnotski
  124.40      102.19    N /usr/games/gnibbles
  137.41      114.93    N /usr/games/gnobots2
  155.53      125.02    N /usr/games/blackjack
  179.85      135.11    N /usr/games/same-gnome
  224.49      154.50    N /usr/bin/gnome-window-properties
  248.44      162.09    N /usr/bin/gnome-default-applications-properties
  282.62      173.29    N /usr/bin/gnome-at-properties
  323.72      188.21    N /usr/bin/gnome-typing-monitor
  363.99      199.93    N /usr/bin/gnome-at-visual
  394.21      206.95    N /usr/bin/gnome-sound-properties
  435.14      224.49    N /usr/bin/gnome-at-mobility
  463.05      234.11    N /usr/bin/gnome-keybinding-properties
  503.75      248.59    N /usr/bin/gnome-about-me
  554.00      276.27    N /usr/bin/gnome-display-properties
  615.48      304.39    N /usr/bin/gnome-network-preferences
  693.03      342.01    N /usr/bin/gnome-mouse-properties
  759.90      388.58    N /usr/bin/gnome-appearance-properties
  937.90      508.47    N /usr/bin/gnome-control-center
 1109.75      587.57    N /usr/bin/gnome-keyboard-properties
 1399.05      758.16    N : oocalc
 1524.64      830.03    N : oodraw
 1684.31      900.03    N : ooimpress
 1874.04      993.91    N : oomath
 2115.12     1081.89    N : ooweb
 2369.02     1161.99    N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
                            before    after
 nr_free_pages              1293      3898
 nr_inactive_anon           59956     53460
 nr_active_anon             26815     30026
 nr_inactive_file           2657      3218
 nr_active_file             2019      2806
 nr_unevictable             4         4
 nr_mlock                   4         4
 nr_anon_pages              26706     27859
*nr_mapped                  3542      4469
 nr_file_pages              72232     67681
 nr_dirty                   1         0
 nr_writeback               123       19
 nr_slab_reclaimable        3375      3534
 nr_slab_unreclaimable      11405     10665
 nr_page_table_pages        8106      7864
 nr_unstable                0         0
 nr_bounce                  0         0
*nr_vmscan_write            394776    230839
 nr_writeback_temp          0         0
 numa_hit                   6843353   3318676
 numa_miss                  0         0
 numa_foreign               0         0
 numa_interleave            1719      1719
 numa_local                 6843353   3318676
 numa_other                 0         0
*pgpgin                     5954683   2057175
*pgpgout                    1578276   922744
*pswpin                     1486615   512238
*pswpout                    394568    230685
 pgalloc_dma                277432    56602
 pgalloc_dma32              6769477   3310348
 pgalloc_normal             0         0
 pgalloc_movable            0         0
 pgfree                     7048396   3371118
 pgactivate                 2036343   1471492
 pgdeactivate               2189691   1612829
 pgfault                    3702176   3100702
*pgmajfault                 452116    201343
 pgrefill_dma               12185     7127
 pgrefill_dma32             334384    653703
 pgrefill_normal            0         0
 pgrefill_movable           0         0
 pgsteal_dma                74214     22179
 pgsteal_dma32              3334164   1638029
 pgsteal_normal             0         0
 pgsteal_movable            0         0
 pgscan_kswapd_dma          1081421   1216199
 pgscan_kswapd_dma32        58979118  46002810
 pgscan_kswapd_normal       0         0
 pgscan_kswapd_movable      0         0
 pgscan_direct_dma          2015438   1086109
 pgscan_direct_dma32        55787823  36101597
 pgscan_direct_normal       0         0
 pgscan_direct_movable      0         0
 pginodesteal               3461      7281
 slabs_scanned              564864    527616
 kswapd_steal               2889797   1448082
 kswapd_inodesteal          14827     14835
 pageoutrun                 43459     21562
 allocstall                 9653      4032
 pgrotated                  384216    228631
1.5) free numbers at the end of the tests
before patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        467          7          0          0        236
                -/+ buffers/cache:        230        243
                Swap:         1023        418        605
after patch:
                             total       used       free     shared    buffers     cached
                Mem:           474        457         16          0          0        236
                -/+ buffers/cache:        221        253
                Swap:         1023        404        619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
        for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
        iotrace.rb --load pattern-hot-10 --play /b/sparse
	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin:     shortly after the big read IO starts;
(2) end:       just before the big read IO stops;
(3) restore:   the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
               windows to restore their working set.
2.3) console mode results
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.29 VM_EXEC protection ON:
begin:       2481             2237             8694              630                0           574299
end:          275           231976           233914              633           776271         20933042
restore:      370           232154           234524              691           777183         20958453
2.6.29 VM_EXEC protection ON (second run):
begin:       2434             2237             8493              629                0           574195
end:          284           231970           233536              632           771918         20896129
restore:      399           232218           234789              690           774526         20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       2479             2344             9659              210                0           579643
end:          284           232010           234142              260           772776         20917184
restore:      379           232159           234371              301           774888         20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
  I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
  That's a huge improvement - which means with the VM_EXEC protection logic,
  active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
  under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
  That roughly means the active mmap pages get 20.8 more chances to get
  re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
  dropped pages are mostly inactive ones. The patch has almost no impact in
  this aspect, that means it won't unnecessarily increase memory pressure.
  (In contrast, your 20% mmap protection ratio will keep them all, and
  therefore eliminate the extra 41 major faults to restore working set
  of zsh etc.)
The iotrace.rb read throughput is
	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
        nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin:       9740             8920            64075              561                0           678360
end:          768           218254           220029              565           798953         21057006
restore:      857           218543           220987              606           799462         21075710
restore X:   2414           218560           225344              797           799462         21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin:       9368             5035            26389              554                0           633391
end:          770           218449           221230              661           646472         17832500
restore:     1113           218466           220978              710           649881         17905235
restore X:   2687           218650           225484              947           802700         21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
  during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
  during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:12 -07:00
										 |  |  | 	 * Count referenced pages from currently used mappings as rotated, | 
					
						
							|  |  |  | 	 * even though only some of them are actually re-activated.  This | 
					
						
							|  |  |  | 	 * helps balance scan pressure between file and anonymous pages in | 
					
						
							|  |  |  | 	 * get_scan_ratio. | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:35 -07:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2009-09-21 17:02:56 -07:00
										 |  |  | 	reclaim_stat->recent_rotated[file] += nr_rotated; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:34 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru); | 
					
						
							|  |  |  | 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE); | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:37 -07:00
										 |  |  | 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken); | 
					
						
							| 
									
										
										
										
											2006-06-30 01:55:45 -07:00
										 |  |  | 	spin_unlock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:56 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	free_hot_cold_page_list(&l_hold, 1); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:31 -07:00
										 |  |  | #ifdef CONFIG_SWAP
 | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  | static int inactive_anon_is_low_global(struct zone *zone) | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:14 -08:00
										 |  |  | { | 
					
						
							|  |  |  | 	unsigned long active, inactive; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	active = zone_page_state(zone, NR_ACTIVE_ANON); | 
					
						
							|  |  |  | 	inactive = zone_page_state(zone, NR_INACTIVE_ANON); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (inactive * zone->inactive_ratio < active) | 
					
						
							|  |  |  | 		return 1; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * inactive_anon_is_low - check if anonymous pages need to be deactivated | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  |  * @lruvec: LRU vector to check | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  |  * | 
					
						
							|  |  |  |  * Returns true if the zone does not have enough inactive anon pages, | 
					
						
							|  |  |  |  * meaning some active anon pages need to be deactivated. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | static int inactive_anon_is_low(struct lruvec *lruvec) | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:31 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If we don't have swap space, anonymous page deactivation | 
					
						
							|  |  |  | 	 * is pointless. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!total_swap_pages) | 
					
						
							|  |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:52 -07:00
										 |  |  | 	if (!mem_cgroup_disabled()) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 		return mem_cgroup_inactive_anon_is_low(lruvec); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 	return inactive_anon_is_low_global(lruvec_zone(lruvec)); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:31 -07:00
										 |  |  | #else
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | static inline int inactive_anon_is_low(struct lruvec *lruvec) | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:31 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | #endif
 | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:18 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:28 -07:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * inactive_file_is_low - check if file pages need to be deactivated | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  |  * @lruvec: LRU vector to check | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:28 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * When the system is doing streaming IO, memory pressure here | 
					
						
							|  |  |  |  * ensures that active file pages get deactivated, until more | 
					
						
							|  |  |  |  * than half of the file pages are on the inactive list. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Once we get to that situation, protect the system's working | 
					
						
							|  |  |  |  * set from being evicted by disabling active file page aging. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * This uses a different ratio than the anonymous pages, because | 
					
						
							|  |  |  |  * the page cache uses a use-once replacement algorithm. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | static int inactive_file_is_low(struct lruvec *lruvec) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:28 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:35:19 -08:00
										 |  |  | 	unsigned long inactive; | 
					
						
							|  |  |  | 	unsigned long active; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE); | 
					
						
							|  |  |  | 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE); | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:28 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:35:19 -08:00
										 |  |  | 	return active > inactive; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:28 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru) | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:48 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	if (is_file_lru(lru)) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 		return inactive_file_is_low(lruvec); | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:48 -08:00
										 |  |  | 	else | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 		return inactive_anon_is_low(lruvec); | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:48 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 				 struct lruvec *lruvec, struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:14 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:48 -08:00
										 |  |  | 	if (is_active_lru(lru)) { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		if (inactive_list_is_low(lruvec, lru)) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 			shrink_active_list(nr_to_scan, lruvec, sc, lru); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:34 -07:00
										 |  |  | 		return 0; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | static int vmscan_swappiness(struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2011-07-26 16:08:21 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc)) | 
					
						
							| 
									
										
										
										
											2011-07-26 16:08:21 -07:00
										 |  |  | 		return vm_swappiness; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	return mem_cgroup_swappiness(sc->target_mem_cgroup); | 
					
						
							| 
									
										
										
										
											2011-07-26 16:08:21 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | enum scan_balance { | 
					
						
							|  |  |  | 	SCAN_EQUAL, | 
					
						
							|  |  |  | 	SCAN_FRACT, | 
					
						
							|  |  |  | 	SCAN_ANON, | 
					
						
							|  |  |  | 	SCAN_FILE, | 
					
						
							|  |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Determine how aggressively the anon and file LRU lists should be | 
					
						
							|  |  |  |  * scanned.  The relative value of each set of LRU lists is determined | 
					
						
							|  |  |  |  * by looking at the fraction of the pages scanned we did rotate back | 
					
						
							|  |  |  |  * onto the active list instead of evict. | 
					
						
							|  |  |  |  * | 
					
						
							| 
									
										
										
										
											2012-06-14 20:41:02 +08:00
										 |  |  |  * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan | 
					
						
							|  |  |  |  * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 			   unsigned long *nr) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; | 
					
						
							|  |  |  | 	u64 fraction[2]; | 
					
						
							|  |  |  | 	u64 denominator = 0;	/* gcc */ | 
					
						
							|  |  |  | 	struct zone *zone = lruvec_zone(lruvec); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	unsigned long anon_prio, file_prio; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 	enum scan_balance scan_balance; | 
					
						
							|  |  |  | 	unsigned long anon, file, free; | 
					
						
							|  |  |  | 	bool force_scan = false; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	unsigned long ap, fp; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:01 -08:00
										 |  |  | 	enum lru_list lru; | 
					
						
							| 
									
										
											  
											
												memcg: fix get_scan_count() for small targets
During memory reclaim we determine the number of pages to be scanned per
zone as
	(anon + file) >> priority.
Assume
	scan = (anon + file) >> priority.
If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher.  This has some problems.
  1. This increases priority as 1 without any scan.
     To do scan in this priority, amount of pages should be larger than 512M.
     If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
     batched, later. (But we lose 1 priority.)
     If memory size is below 16M, pages >> priority is 0 and no scan in
     DEF_PRIORITY forever.
  2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
     So, x86's ZONE_DMA will never be recoverred until the user of pages
     frees memory by itself.
  3. With memcg, the limit of memory can be small. When using small memcg,
     it gets priority < DEF_PRIORITY-2 very easily and need to call
     wait_iff_congested().
     For doing scan before priorty=9, 64MB of memory should be used.
Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
  1. the target is enough small.
  2. it's kswapd or memcg reclaim.
Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
 This will allow scanning in this priority even when pages >> priority is
very small.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-05-26 16:25:34 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:07:27 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If the zone or memcg is small, nr[l] can be 0.  This | 
					
						
							|  |  |  | 	 * results in no scanning on this priority and a potential | 
					
						
							|  |  |  | 	 * priority drop.  Global direct reclaim can go to the next | 
					
						
							|  |  |  | 	 * zone and tends to have no problems. Global kswapd is for | 
					
						
							|  |  |  | 	 * zone balancing and it needs to scan a minimum amount. When | 
					
						
							|  |  |  | 	 * reclaiming for a memcg, a priority drop can cause high | 
					
						
							|  |  |  | 	 * latencies, so it's better to scan a minimum amount there as | 
					
						
							|  |  |  | 	 * well. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 	if (current_is_kswapd() && !zone_reclaimable(zone)) | 
					
						
							| 
									
										
										
										
											2011-09-14 16:21:52 -07:00
										 |  |  | 		force_scan = true; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (!global_reclaim(sc)) | 
					
						
							| 
									
										
										
										
											2011-09-14 16:21:52 -07:00
										 |  |  | 		force_scan = true; | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* If we have no swap space, do not bother scanning anon pages. */ | 
					
						
							| 
									
										
											  
											
												swap: add per-partition lock for swapfile
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes
from swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list.  To delete that code, we add a
new highest_priority_index.  Whenever get_swap_page() is called, we
check it.  If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the
speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.
[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-02-22 16:34:38 -08:00
										 |  |  | 	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 		scan_balance = SCAN_FILE; | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:14 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Global reclaim will swap to prevent OOM even with no | 
					
						
							|  |  |  | 	 * swappiness, but memcg users want to use this knob to | 
					
						
							|  |  |  | 	 * disable swapping for individual groups completely when | 
					
						
							|  |  |  | 	 * using the memory controller's swap limit feature would be | 
					
						
							|  |  |  | 	 * too expensive. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 		scan_balance = SCAN_FILE; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:14 -08:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Do not apply any pressure balancing cleverness when the | 
					
						
							|  |  |  | 	 * system is close to OOM, scan both anon and file equally | 
					
						
							|  |  |  | 	 * (unless the swappiness setting disagrees with swapping). | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!sc->priority && vmscan_swappiness(sc)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 		scan_balance = SCAN_EQUAL; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:14 -08:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:08 -07:00
										 |  |  | 	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) + | 
					
						
							|  |  |  | 		get_lru_size(lruvec, LRU_INACTIVE_ANON); | 
					
						
							|  |  |  | 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) + | 
					
						
							|  |  |  | 		get_lru_size(lruvec, LRU_INACTIVE_FILE); | 
					
						
							| 
									
										
										
										
											2011-09-14 16:21:52 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:15 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If it's foreseeable that reclaiming the file cache won't be | 
					
						
							|  |  |  | 	 * enough to get the zone back into a desirable shape, we have | 
					
						
							|  |  |  | 	 * to swap.  Better start now and leave the - probably heavily | 
					
						
							|  |  |  | 	 * thrashing - remaining file pages alone. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:15 -08:00
										 |  |  | 		free = zone_page_state(zone, NR_FREE_PAGES); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 		if (unlikely(file + free <= high_wmark_pages(zone))) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 			scan_balance = SCAN_ANON; | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 			goto out; | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:17 -08:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:10 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * There is enough inactive page cache, do not reclaim | 
					
						
							|  |  |  | 	 * anything from the anonymous working set right now. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!inactive_file_is_low(lruvec)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 		scan_balance = SCAN_FILE; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:10 -08:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 	scan_balance = SCAN_FRACT; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:51 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * With swappiness at 100, anonymous and file have the same priority. | 
					
						
							|  |  |  | 	 * This scanning priority is essentially the inverse of IO cost. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	anon_prio = vmscan_swappiness(sc); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 	file_prio = 200 - anon_prio; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:51 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * OK, so we have swap space and a fair amount of page cache | 
					
						
							|  |  |  | 	 * pages.  We use the recently rotated / recently scanned | 
					
						
							|  |  |  | 	 * ratios to determine how valuable each cache is. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * Because workloads change over time (and to avoid overflow) | 
					
						
							|  |  |  | 	 * we keep these statistics as a floating average, which ends | 
					
						
							|  |  |  | 	 * up weighing recent references more than old ones. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * anon in [0], file in [1] | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	spin_lock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:15 -08:00
										 |  |  | 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) { | 
					
						
							|  |  |  | 		reclaim_stat->recent_scanned[0] /= 2; | 
					
						
							|  |  |  | 		reclaim_stat->recent_rotated[0] /= 2; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:15 -08:00
										 |  |  | 	if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) { | 
					
						
							|  |  |  | 		reclaim_stat->recent_scanned[1] /= 2; | 
					
						
							|  |  |  | 		reclaim_stat->recent_rotated[1] /= 2; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2008-11-19 15:36:44 -08:00
										 |  |  | 	 * The amount of pressure on anon vs file pages is inversely | 
					
						
							|  |  |  | 	 * proportional to the fraction of recently scanned pages on | 
					
						
							|  |  |  | 	 * each list that were recently referenced and in active use. | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:47 -07:00
										 |  |  | 	ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:15 -08:00
										 |  |  | 	ap /= reclaim_stat->recent_rotated[0] + 1; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:47 -07:00
										 |  |  | 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1); | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:15 -08:00
										 |  |  | 	fp /= reclaim_stat->recent_rotated[1] + 1; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 	spin_unlock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 	fraction[0] = ap; | 
					
						
							|  |  |  | 	fraction[1] = fp; | 
					
						
							|  |  |  | 	denominator = ap + fp + 1; | 
					
						
							|  |  |  | out: | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:01 -08:00
										 |  |  | 	for_each_evictable_lru(lru) { | 
					
						
							|  |  |  | 		int file = is_file_lru(lru); | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:12 -08:00
										 |  |  | 		unsigned long size; | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 		unsigned long scan; | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:29 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:12 -08:00
										 |  |  | 		size = get_lru_size(lruvec, lru); | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:14 -08:00
										 |  |  | 		scan = size >> sc->priority; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:14 -08:00
										 |  |  | 		if (!scan && force_scan) | 
					
						
							|  |  |  | 			scan = min(size, SWAP_CLUSTER_MAX); | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:17 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		switch (scan_balance) { | 
					
						
							|  |  |  | 		case SCAN_EQUAL: | 
					
						
							|  |  |  | 			/* Scan lists relative to size */ | 
					
						
							|  |  |  | 			break; | 
					
						
							|  |  |  | 		case SCAN_FRACT: | 
					
						
							|  |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Scan types proportional to swappiness and | 
					
						
							|  |  |  | 			 * their relative recent reclaim efficiency. | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			scan = div64_u64(scan * fraction[file], denominator); | 
					
						
							|  |  |  | 			break; | 
					
						
							|  |  |  | 		case SCAN_FILE: | 
					
						
							|  |  |  | 		case SCAN_ANON: | 
					
						
							|  |  |  | 			/* Scan one type exclusively */ | 
					
						
							|  |  |  | 			if ((scan_balance == SCAN_FILE) != file) | 
					
						
							|  |  |  | 				scan = 0; | 
					
						
							|  |  |  | 			break; | 
					
						
							|  |  |  | 		default: | 
					
						
							|  |  |  | 			/* Look ma, no brain */ | 
					
						
							|  |  |  | 			BUG(); | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-01-12 17:20:01 -08:00
										 |  |  | 		nr[lru] = scan; | 
					
						
							| 
									
										
											  
											
												vmscan: prevent get_scan_ratio() rounding errors
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.  In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2.  making our scan more smoothly.  Without this, if percent[x] is
   underflow, shrink_zone() doesn't scan any pages and suddenly it scans
   all pages when priority is zero.  With this, even priority isn't zero,
   shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision.  For system with a lot of memory, this might slightly changes
behavior.  For example, in a sequential file read workload, without the
patch, we don't swap any anon pages.  With it, if anon memory size is
bigger than 16G, we will see one anon page swapped.  The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
which is common in this workload.  So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2010-05-24 14:32:36 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:29 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long nr[NR_LRU_LISTS]; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 	unsigned long targets[NR_LRU_LISTS]; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	unsigned long nr_to_scan; | 
					
						
							|  |  |  | 	enum lru_list lru; | 
					
						
							|  |  |  | 	unsigned long nr_reclaimed = 0; | 
					
						
							|  |  |  | 	unsigned long nr_to_reclaim = sc->nr_to_reclaim; | 
					
						
							|  |  |  | 	struct blk_plug plug; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 	bool scan_adjusted = false; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	get_scan_count(lruvec, sc, nr); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 	/* Record the original scan target for proportional adjustments later */ | 
					
						
							|  |  |  | 	memcpy(targets, nr, sizeof(nr)); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	blk_start_plug(&plug); | 
					
						
							|  |  |  | 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || | 
					
						
							|  |  |  | 					nr[LRU_INACTIVE_FILE]) { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 		unsigned long nr_anon, nr_file, percentage; | 
					
						
							|  |  |  | 		unsigned long nr_scanned; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 		for_each_evictable_lru(lru) { | 
					
						
							|  |  |  | 			if (nr[lru]) { | 
					
						
							|  |  |  | 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX); | 
					
						
							|  |  |  | 				nr[lru] -= nr_to_scan; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 				nr_reclaimed += shrink_list(lru, nr_to_scan, | 
					
						
							|  |  |  | 							    lruvec, sc); | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		if (nr_reclaimed < nr_to_reclaim || scan_adjusted) | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 		 * For global direct reclaim, reclaim only the number of pages | 
					
						
							|  |  |  | 		 * requested. Less care is taken to scan proportionally as it | 
					
						
							|  |  |  | 		 * is more important to minimise direct reclaim stall latency | 
					
						
							|  |  |  | 		 * than it is to properly age the LRU lists. | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 		if (global_reclaim(sc) && !current_is_kswapd()) | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			break; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:44 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * For kswapd and memcg, reclaim at least the number of pages | 
					
						
							|  |  |  | 		 * requested. Ensure that the anon and file LRUs shrink | 
					
						
							|  |  |  | 		 * proportionally what was requested by get_scan_count(). We | 
					
						
							|  |  |  | 		 * stop reclaiming one LRU and reduce the amount scanning | 
					
						
							|  |  |  | 		 * proportional to the original scan target. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE]; | 
					
						
							|  |  |  | 		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON]; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		if (nr_file > nr_anon) { | 
					
						
							|  |  |  | 			unsigned long scan_target = targets[LRU_INACTIVE_ANON] + | 
					
						
							|  |  |  | 						targets[LRU_ACTIVE_ANON] + 1; | 
					
						
							|  |  |  | 			lru = LRU_BASE; | 
					
						
							|  |  |  | 			percentage = nr_anon * 100 / scan_target; | 
					
						
							|  |  |  | 		} else { | 
					
						
							|  |  |  | 			unsigned long scan_target = targets[LRU_INACTIVE_FILE] + | 
					
						
							|  |  |  | 						targets[LRU_ACTIVE_FILE] + 1; | 
					
						
							|  |  |  | 			lru = LRU_FILE; | 
					
						
							|  |  |  | 			percentage = nr_file * 100 / scan_target; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/* Stop scanning the smaller of the LRU */ | 
					
						
							|  |  |  | 		nr[lru] = 0; | 
					
						
							|  |  |  | 		nr[lru + LRU_ACTIVE] = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Recalculate the other LRU scan count based on its original | 
					
						
							|  |  |  | 		 * scan target and the percentage scanning already complete | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE; | 
					
						
							|  |  |  | 		nr_scanned = targets[lru] - nr[lru]; | 
					
						
							|  |  |  | 		nr[lru] = targets[lru] * (100 - percentage) / 100; | 
					
						
							|  |  |  | 		nr[lru] -= min(nr[lru], nr_scanned); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		lru += LRU_ACTIVE; | 
					
						
							|  |  |  | 		nr_scanned = targets[lru] - nr[lru]; | 
					
						
							|  |  |  | 		nr[lru] = targets[lru] * (100 - percentage) / 100; | 
					
						
							|  |  |  | 		nr[lru] -= min(nr[lru], nr_scanned); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		scan_adjusted = true; | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	} | 
					
						
							|  |  |  | 	blk_finish_plug(&plug); | 
					
						
							|  |  |  | 	sc->nr_reclaimed += nr_reclaimed; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Even if we did not try to evict anon pages at all, we want to | 
					
						
							|  |  |  | 	 * rebalance the anon lru active/inactive ratio. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (inactive_anon_is_low(lruvec)) | 
					
						
							|  |  |  | 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec, | 
					
						
							|  |  |  | 				   sc, LRU_ACTIVE_ANON); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	throttle_vm_writeout(sc->gfp_mask); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | /* Use reclaim/compaction for costly allocs or under memory pressure */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | static bool in_reclaim_compaction(struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-12-11 16:00:31 -08:00
										 |  |  | 	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | 			(sc->order > PAGE_ALLOC_COSTLY_ORDER || | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 			 sc->priority < DEF_PRIORITY - 2)) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  | 		return true; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return false; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:20 -07:00
										 |  |  |  * Reclaim/compaction is used for high-order allocation requests. It reclaims | 
					
						
							|  |  |  |  * order-0 pages before compacting the zone. should_continue_reclaim() returns | 
					
						
							|  |  |  |  * true if more pages should be reclaimed such that when the page allocator | 
					
						
							|  |  |  |  * calls try_to_compact_zone() that it will have enough free pages to succeed. | 
					
						
							|  |  |  |  * It will give up earlier than that if there is difficulty reclaiming pages. | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | static inline bool should_continue_reclaim(struct zone *zone, | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 					unsigned long nr_reclaimed, | 
					
						
							|  |  |  | 					unsigned long nr_scanned, | 
					
						
							|  |  |  | 					struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long pages_for_compaction; | 
					
						
							|  |  |  | 	unsigned long inactive_lru_pages; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* If not in reclaim/compaction mode, stop */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	if (!in_reclaim_compaction(sc)) | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
should_continue_reclaim() for reclaim/compaction allows scanning to
continue even if pages are not being reclaimed until the full list is
scanned.  In terms of allocation success, this makes sense but potentially
it introduces unwanted latency for high-order allocations such as
transparent hugepages and network jumbo frames that would prefer to fail
the allocation attempt and fallback to order-0 pages.  Worse, there is a
potential that the full LRU scan will clear all the young bits, distort
page aging information and potentially push pages into swap that would
have otherwise remained resident.
This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
LRU list may still be scanned.
Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined:
        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                return false;
A tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and
three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
applied.
  STREAM Highorder Allocation Latency Statistics
                 fix-infinite     break-early
  1 :: Count            10298           10229
  1 :: Min             0.4560          0.4640
  1 :: Mean            1.0589          1.0183
  1 :: Max            14.5990         11.7510
  1 :: Stddev          0.5208          0.4719
  2 :: Count                2               1
  2 :: Min             1.8610          3.7240
  2 :: Mean            3.4325          3.7240
  2 :: Max             5.0040          3.7240
  2 :: Stddev          1.5715          0.0000
  9 :: Count           111696          111694
  9 :: Min             0.5230          0.4110
  9 :: Mean           10.5831         10.5718
  9 :: Max            38.4480         43.2900
  9 :: Stddev          1.1147          1.1325
Mean time for order-1 allocations is reduced.  order-2 looks increased but
with so few allocations, it's not particularly significant.  THP mean
allocation latency is also reduced.  That said, allocation time varies so
significantly that the reductions are within noise.
Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.
  SysBench Highorder Allocation Latency Statistics
                 fix-infinite     break-early
  1 :: Count            15745           15677
  1 :: Min             0.4250          0.4550
  1 :: Mean            1.1023          1.0810
  1 :: Max            14.4590         10.8220
  1 :: Stddev          0.5117          0.5100
  2 :: Count                1               1
  2 :: Min             3.0040          2.1530
  2 :: Mean            3.0040          2.1530
  2 :: Max             3.0040          2.1530
  2 :: Stddev          0.0000          0.0000
  9 :: Count             2017            1931
  9 :: Min             0.4980          0.7480
  9 :: Mean           10.4717         10.3840
  9 :: Max            24.9460         26.2500
  9 :: Stddev          1.1726          1.1966
Again, mean time for order-1 allocations is reduced while order-2
allocations are too few to draw conclusions from.  The mean time for THP
allocations is also slightly reduced albeit the reductions are within
varianes.
Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.
  Anon stream mmap reference Highorder Allocation Latency Statistics
  1 :: Count             1376            1790
  1 :: Min             0.4940          0.5010
  1 :: Mean            1.0289          0.9732
  1 :: Max             6.2670          4.2540
  1 :: Stddev          0.4142          0.2785
  2 :: Count                1               -
  2 :: Min             1.9060               -
  2 :: Mean            1.9060               -
  2 :: Max             1.9060               -
  2 :: Stddev          0.0000               -
  9 :: Count            11266           11257
  9 :: Min             0.4990          0.4940
  9 :: Mean        27250.4669      24256.1919
  9 :: Max      11439211.0000    6008885.0000
  9 :: Stddev     226427.4624     186298.1430
This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM.  This pounds swap
quite heavily and is intended to exercise THP a bit.
Mean allocation time for order-1 is reduced as before.  It's also reduced
for THP allocations but the variations here are pretty massive due to
swap.  As before, maximum allocation times are significantly reduced.
Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations.  This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.
The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments
due to reclaim/compaction breaking early.  However, workloads care more
about the latency of lower-order allocations than THP so it's an
acceptable trade-off.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-02-25 14:44:20 -08:00
										 |  |  | 	/* Consider stopping depending on scan and reclaim activity */ | 
					
						
							|  |  |  | 	if (sc->gfp_mask & __GFP_REPEAT) { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * For __GFP_REPEAT allocations, stop reclaiming if the | 
					
						
							|  |  |  | 		 * full LRU list has been scanned and we are still failing | 
					
						
							|  |  |  | 		 * to reclaim pages. This full LRU scan is potentially | 
					
						
							|  |  |  | 		 * expensive but a __GFP_REPEAT caller really wants to succeed | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (!nr_reclaimed && !nr_scanned) | 
					
						
							|  |  |  | 			return false; | 
					
						
							|  |  |  | 	} else { | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * For non-__GFP_REPEAT allocations which can presumably | 
					
						
							|  |  |  | 		 * fail without consequence, stop if we failed to reclaim | 
					
						
							|  |  |  | 		 * any pages from the last SWAP_CLUSTER_MAX number of | 
					
						
							|  |  |  | 		 * pages that were scanned. This will return to the | 
					
						
							|  |  |  | 		 * caller faster at the risk reclaim/compaction and | 
					
						
							|  |  |  | 		 * the resulting allocation attempt fails | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (!nr_reclaimed) | 
					
						
							|  |  |  | 			return false; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If we have not reclaimed enough pages for compaction and the | 
					
						
							|  |  |  | 	 * inactive lists are large enough, continue reclaiming | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	pages_for_compaction = (2UL << sc->order); | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE); | 
					
						
							| 
									
										
											  
											
												swap: add per-partition lock for swapfile
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes
from swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list.  To delete that code, we add a
new highest_priority_index.  Whenever get_swap_page() is called, we
check it.  If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the
speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.
[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-02-22 16:34:38 -08:00
										 |  |  | 	if (get_nr_swap_pages() > 0) | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON); | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 	if (sc->nr_reclaimed < pages_for_compaction && | 
					
						
							|  |  |  | 			inactive_lru_pages > pages_for_compaction) | 
					
						
							|  |  |  | 		return true; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* If compaction would go ahead or the allocation would succeed, stop */ | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	switch (compaction_suitable(zone, sc->order)) { | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 	case COMPACT_PARTIAL: | 
					
						
							|  |  |  | 	case COMPACT_CONTINUE: | 
					
						
							|  |  |  | 		return false; | 
					
						
							|  |  |  | 	default: | 
					
						
							|  |  |  | 		return true; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:41 -07:00
										 |  |  | static void shrink_zone(struct zone *zone, struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2011-02-10 15:01:34 -08:00
										 |  |  | 	unsigned long nr_reclaimed, nr_scanned; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	do { | 
					
						
							|  |  |  | 		struct mem_cgroup *root = sc->target_mem_cgroup; | 
					
						
							|  |  |  | 		struct mem_cgroup_reclaim_cookie reclaim = { | 
					
						
							|  |  |  | 			.zone = zone, | 
					
						
							|  |  |  | 			.priority = sc->priority, | 
					
						
							|  |  |  | 		}; | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:37 -07:00
										 |  |  | 		struct mem_cgroup *memcg; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:56 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 		nr_reclaimed = sc->nr_reclaimed; | 
					
						
							|  |  |  | 		nr_scanned = sc->nr_scanned; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:37 -07:00
										 |  |  | 		memcg = mem_cgroup_iter(root, NULL, &reclaim); | 
					
						
							|  |  |  | 		do { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			struct lruvec *lruvec; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:59 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			lruvec = mem_cgroup_zone_lruvec(zone, memcg); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:02 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			shrink_lruvec(lruvec, sc); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			/*
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:30 -08:00
										 |  |  | 			 * Direct reclaim and kswapd have to scan all memory | 
					
						
							|  |  |  | 			 * cgroups to fulfill the overall scan target for the | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			 * zone. | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:30 -08:00
										 |  |  | 			 * | 
					
						
							|  |  |  | 			 * Limit reclaim, on the other hand, only cares about | 
					
						
							|  |  |  | 			 * nr_to_reclaim pages to be reclaimed and it will | 
					
						
							|  |  |  | 			 * retry with decreasing priority if one round over the | 
					
						
							|  |  |  | 			 * whole hierarchy is not sufficient. | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:30 -08:00
										 |  |  | 			if (!global_reclaim(sc) && | 
					
						
							|  |  |  | 					sc->nr_reclaimed >= sc->nr_to_reclaim) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 				mem_cgroup_iter_break(root, memcg); | 
					
						
							|  |  |  | 				break; | 
					
						
							|  |  |  | 			} | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:37 -07:00
										 |  |  | 			memcg = mem_cgroup_iter(root, memcg, &reclaim); | 
					
						
							|  |  |  | 		} while (memcg); | 
					
						
							| 
									
										
											  
											
												memcg: add memory.pressure_level events
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-04-29 15:08:31 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, | 
					
						
							|  |  |  | 			   sc->nr_scanned - nr_scanned, | 
					
						
							|  |  |  | 			   sc->nr_reclaimed - nr_reclaimed); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:19 -08:00
										 |  |  | 	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, | 
					
						
							|  |  |  | 					 sc->nr_scanned - nr_scanned, sc)); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:45 -08:00
										 |  |  | /* Returns true if compaction should go ahead for a high-order request */ | 
					
						
							|  |  |  | static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long balance_gap, watermark; | 
					
						
							|  |  |  | 	bool watermark_ok; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* Do not consider compaction for orders reclaim is meant to satisfy */ | 
					
						
							|  |  |  | 	if (sc->order <= PAGE_ALLOC_COSTLY_ORDER) | 
					
						
							|  |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Compaction takes time to run and there are potentially other | 
					
						
							|  |  |  | 	 * callers using the pages just freed. Continue reclaiming until | 
					
						
							|  |  |  | 	 * there is a buffer of free pages available to give compaction | 
					
						
							|  |  |  | 	 * a reasonable chance of completing and allocating the page | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	balance_gap = min(low_wmark_pages(zone), | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:45 -08:00
										 |  |  | 			KSWAPD_ZONE_BALANCE_GAP_RATIO); | 
					
						
							|  |  |  | 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order); | 
					
						
							|  |  |  | 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If compaction is deferred, reclaim up to a point where | 
					
						
							|  |  |  | 	 * compaction will have a chance of success when re-enabled | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:52 -07:00
										 |  |  | 	if (compaction_deferred(zone, sc->order)) | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:45 -08:00
										 |  |  | 		return watermark_ok; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* If compaction is not ready to start, keep reclaiming */ | 
					
						
							|  |  |  | 	if (!compaction_suitable(zone, sc->order)) | 
					
						
							|  |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return watermark_ok; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * This is the direct reclaim path, for page-allocating processes.  We only | 
					
						
							|  |  |  |  * try to reclaim pages from zones which will satisfy the caller's allocation | 
					
						
							|  |  |  |  * request. | 
					
						
							|  |  |  |  * | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:12 -07:00
										 |  |  |  * We reclaim from a zone even if that zone is over high_wmark_pages(zone). | 
					
						
							|  |  |  |  * Because: | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * a) The caller may be trying to free *extra* pages to satisfy a higher-order | 
					
						
							|  |  |  |  *    allocation or | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:12 -07:00
										 |  |  |  * b) The target zone may be at high_wmark_pages(zone) but the lower zones | 
					
						
							|  |  |  |  *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min' | 
					
						
							|  |  |  |  *    zone defense algorithm. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * If a zone is deemed to be full of pinned pages then just give it a light | 
					
						
							|  |  |  |  * scan then give up on it. | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:33 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * This function returns true if a zone is being reclaimed for a costly | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:45 -08:00
										 |  |  |  * high-order allocation and compaction is ready to begin. This indicates to | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  |  * the caller that it should consider retrying the allocation instead of | 
					
						
							|  |  |  |  * further reclaim. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:17 -07:00
										 |  |  | 	struct zoneref *z; | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:16 -07:00
										 |  |  | 	struct zone *zone; | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:41 -07:00
										 |  |  | 	unsigned long nr_soft_reclaimed; | 
					
						
							|  |  |  | 	unsigned long nr_soft_scanned; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  | 	bool aborted_reclaim = false; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:00 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If the number of buffer_heads in the machine exceeds the maximum | 
					
						
							|  |  |  | 	 * allowed level, force direct reclaim to scan the highmem zone as | 
					
						
							|  |  |  | 	 * highmem pages could be pinning lowmem pages storing buffer_heads | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (buffer_heads_over_limit) | 
					
						
							|  |  |  | 		sc->gfp_mask |= __GFP_HIGHMEM; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:29 -07:00
										 |  |  | 	for_each_zone_zonelist_nodemask(zone, z, zonelist, | 
					
						
							|  |  |  | 					gfp_zone(sc->gfp_mask), sc->nodemask) { | 
					
						
							| 
									
										
										
										
											2006-01-06 00:11:15 -08:00
										 |  |  | 		if (!populated_zone(zone)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Take care memory controller reclaiming has small influence | 
					
						
							|  |  |  | 		 * to global LRU. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 		if (global_reclaim(sc)) { | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 
					
						
							|  |  |  | 				continue; | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 			if (sc->priority != DEF_PRIORITY && | 
					
						
							|  |  |  | 			    !zone_reclaimable(zone)) | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 				continue;	/* Let kswapd poll it */ | 
					
						
							| 
									
										
										
										
											2012-12-11 16:00:31 -08:00
										 |  |  | 			if (IS_ENABLED(CONFIG_COMPACTION)) { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:31 -07:00
										 |  |  | 				/*
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:33 -07:00
										 |  |  | 				 * If we already have plenty of memory free for | 
					
						
							|  |  |  | 				 * compaction in this zone, don't free any more. | 
					
						
							|  |  |  | 				 * Even though compaction is invoked for any | 
					
						
							|  |  |  | 				 * non-zero order, only frequent costly order | 
					
						
							|  |  |  | 				 * reclamation is disruptive enough to become a | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:10 -07:00
										 |  |  | 				 * noticeable problem, like transparent huge | 
					
						
							|  |  |  | 				 * page allocations. | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:31 -07:00
										 |  |  | 				 */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:45 -08:00
										 |  |  | 				if (compaction_ready(zone, sc)) { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  | 					aborted_reclaim = true; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:31 -07:00
										 |  |  | 					continue; | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:33 -07:00
										 |  |  | 				} | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:31 -07:00
										 |  |  | 			} | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:41 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * This steals pages from memory cgroups over softlimit | 
					
						
							|  |  |  | 			 * and returns the number of reclaimed pages and | 
					
						
							|  |  |  | 			 * scanned pages. This works for global memory pressure | 
					
						
							|  |  |  | 			 * and balancing, not for a memcg's limit. | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			nr_soft_scanned = 0; | 
					
						
							|  |  |  | 			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, | 
					
						
							|  |  |  | 						sc->order, sc->gfp_mask, | 
					
						
							|  |  |  | 						&nr_soft_scanned); | 
					
						
							|  |  |  | 			sc->nr_reclaimed += nr_soft_reclaimed; | 
					
						
							|  |  |  | 			sc->nr_scanned += nr_soft_scanned; | 
					
						
							| 
									
										
										
										
											2011-06-27 16:18:12 -07:00
										 |  |  | 			/* need some check for avoid more shrink_zone() */ | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:27 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		shrink_zone(zone, sc); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:33 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  | 	return aborted_reclaim; | 
					
						
							| 
									
										
										
										
											2010-09-22 13:05:01 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.
This resulted in the kernel hanging up when executing a loop of the form
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap
as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!
Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-04-14 15:22:12 -07:00
										 |  |  | /* All zones in zonelist are unreclaimable? */ | 
					
						
							| 
									
										
										
										
											2010-09-22 13:05:01 -07:00
										 |  |  | static bool all_unreclaimable(struct zonelist *zonelist, | 
					
						
							|  |  |  | 		struct scan_control *sc) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct zoneref *z; | 
					
						
							|  |  |  | 	struct zone *zone; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	for_each_zone_zonelist_nodemask(zone, z, zonelist, | 
					
						
							|  |  |  | 			gfp_zone(sc->gfp_mask), sc->nodemask) { | 
					
						
							|  |  |  | 		if (!populated_zone(zone)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 		if (zone_reclaimable(zone)) | 
					
						
							| 
									
										
											  
											
												vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.
This resulted in the kernel hanging up when executing a loop of the form
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap
as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!
Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-04-14 15:22:12 -07:00
										 |  |  | 			return false; | 
					
						
							| 
									
										
										
										
											2010-09-22 13:05:01 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.
This resulted in the kernel hanging up when executing a loop of the form
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap
as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!
Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-04-14 15:22:12 -07:00
										 |  |  | 	return true; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * This is the main entry point to direct page reclaim. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If a full scan of the inactive list fails to free enough memory then we | 
					
						
							|  |  |  |  * are "out of memory" and something needs to be killed. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If the caller is !__GFP_FS then the probability of a failure is reasonably | 
					
						
							|  |  |  |  * high - the zone may be full of dirty or under-writeback pages, which this | 
					
						
							| 
									
										
										
										
											2009-09-23 19:37:09 +02:00
										 |  |  |  * caller can't do much about.  We kick the writeback threads and take explicit | 
					
						
							|  |  |  |  * naps in the hope that some of these pages can be written.  But if the | 
					
						
							|  |  |  |  * allocating task holds filesystem locks which prevent writeout this might not | 
					
						
							|  |  |  |  * work, and the allocation attempt will fail. | 
					
						
							| 
									
										
											  
											
												page allocator: smarter retry of costly-order allocations
Because of page order checks in __alloc_pages(), hugepage (and similarly
large order) allocations will not retry unless explicitly marked
__GFP_REPEAT. However, the current retry logic is nearly an infinite
loop (or until reclaim does no progress whatsoever). For these costly
allocations, that seems like overkill and could potentially never
terminate. Mel observed that allowing current __GFP_REPEAT semantics for
hugepage allocations essentially killed the system. I believe this is
because we may continue to reclaim small orders of pages all over, but
never have enough to satisfy the hugepage allocation request. This is
clearly only a problem for large order allocations, of which hugepages
are the most obvious (to me).
Modify try_to_free_pages() to indicate how many pages were reclaimed.
Use that information in __alloc_pages() to eventually fail a large
__GFP_REPEAT allocation when we've reclaimed an order of pages equal to
or greater than the allocation's order. This relies on lumpy reclaim
functioning as advertised. Due to fragmentation, lumpy reclaim may not
be able to free up the order needed in one invocation, so multiple
iterations may be requred. In other words, the more fragmented memory
is, the more retry attempts __GFP_REPEAT will make (particularly for
higher order allocations).
This changes the semantics of __GFP_REPEAT subtly, but *only* for
allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
allocations, we will try up to some point (at least 1<<order reclaimed
pages), rather than forever (which is the case for allocations <=
PAGE_ALLOC_COSTLY_ORDER).
This change improves the /proc/sys/vm/nr_hugepages interface with a
follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
than administrators repeatedly echo'ing a particular value into the
sysctl, and forcing reclaim into action manually, this change allows for
the sysctl to attempt a reasonable effort itself. Similarly, dynamic
pool growth should be more successful under load, as lumpy reclaim can
try to free up pages, rather than failing right away.
Choosing to reclaim only up to the order of the requested allocation
strikes a balance between not failing hugepage allocations and returning
to the caller when it's unlikely to every succeed. Because of lumpy
reclaim, if we have freed the order requested, hopefully it has been in
big chunks and those chunks will allow our allocation to succeed. If
that isn't the case after freeing up the current order, I don't think it
is likely to succeed in the future, although it is possible given a
particular fragmentation pattern.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-04-29 00:58:25 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * returns:	0, if no pages reclaimed | 
					
						
							|  |  |  |  * 		else, the number of pages reclaimed | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:12 -07:00
										 |  |  | static unsigned long do_try_to_free_pages(struct zonelist *zonelist, | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 					struct scan_control *sc, | 
					
						
							|  |  |  | 					struct shrink_control *shrink) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:19 -08:00
										 |  |  | 	unsigned long total_scanned = 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	struct reclaim_state *reclaim_state = current->reclaim_state; | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:17 -07:00
										 |  |  | 	struct zoneref *z; | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:16 -07:00
										 |  |  | 	struct zone *zone; | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:10 -08:00
										 |  |  | 	unsigned long writeback_threshold; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  | 	bool aborted_reclaim; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-07-25 01:48:52 -07:00
										 |  |  | 	delayacct_freepages_start(); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc)) | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:37 -08:00
										 |  |  | 		count_vm_event(ALLOCSTALL); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	do { | 
					
						
							| 
									
										
											  
											
												memcg: add memory.pressure_level events
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-04-29 15:08:31 -07:00
										 |  |  | 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, | 
					
						
							|  |  |  | 				sc->priority); | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		sc->nr_scanned = 0; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		aborted_reclaim = shrink_zones(zonelist, sc); | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:33 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2013-07-08 16:00:24 -07:00
										 |  |  | 		 * Don't shrink slabs when reclaiming memory from over limit | 
					
						
							|  |  |  | 		 * cgroups but do shrink slab at least once when aborting | 
					
						
							|  |  |  | 		 * reclaim for compaction to avoid unevenly scanning file/anon | 
					
						
							|  |  |  | 		 * LRU pages over slab pages. | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 		if (global_reclaim(sc)) { | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:14 -07:00
										 |  |  | 			unsigned long lru_pages = 0; | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:03 +10:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			nodes_clear(shrink->nodes_to_scan); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:29 -07:00
										 |  |  | 			for_each_zone_zonelist(zone, z, zonelist, | 
					
						
							|  |  |  | 					gfp_zone(sc->gfp_mask)) { | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:14 -07:00
										 |  |  | 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 
					
						
							|  |  |  | 					continue; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 				lru_pages += zone_reclaimable_pages(zone); | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:03 +10:00
										 |  |  | 				node_set(zone_to_nid(zone), | 
					
						
							|  |  |  | 					 shrink->nodes_to_scan); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:14 -07:00
										 |  |  | 			} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:27 -07:00
										 |  |  | 			shrink_slab(shrink, sc->nr_scanned, lru_pages); | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:29 -08:00
										 |  |  | 			if (reclaim_state) { | 
					
						
							| 
									
										
											  
											
												vmscan: bail out of direct reclaim after swap_cluster_max pages
When the VM is under pressure, it can happen that several direct reclaim
processes are in the pageout code simultaneously.  It also happens that
the reclaiming processes run into mostly referenced, mapped and dirty
pages in the first round.
This results in multiple direct reclaim processes having a lower
pageout priority, which corresponds to a higher target of pages to
scan.
This in turn can result in each direct reclaim process freeing
many pages.  Together, they can end up freeing way too many pages.
This kicks useful data out of memory (in some cases more than half
of all memory is swapped out).  It also impacts performance by
keeping tasks stuck in the pageout code for too long.
A 30% improvement in hackbench has been observed with this patch.
The fix is relatively simple: in shrink_zone() we can check how many
pages we have already freed, direct reclaim tasks break out of the
scanning loop if they have already freed enough pages and have reached
a lower priority level.
We do not break out of shrink_zone() when priority == DEF_PRIORITY,
to ensure that equal pressure is applied to every zone in the common
case.
However, in order to do this we do need to know how many pages we already
freed, so move nr_reclaimed into scan_control.
akpm: a historical interlude...
We tried this in 2004:
:commit e468e46a9bea3297011d5918663ce6d19094cf87
:Author: akpm <akpm>
:Date:   Thu Jun 24 15:53:52 2004 +0000
:
:[PATCH] vmscan.c: dont reclaim too many pages
:
:    The shrink_zone() logic can, under some circumstances, cause far too many
:    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
:    a large number of reclaimable pages on the LRU.
:    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
And we reverted it in 2006:
:commit 210fe530305ee50cd889fe9250168228b2994f32
:Author: Andrew Morton <akpm@osdl.org>
:Date:   Fri Jan 6 00:11:14 2006 -0800
:
:    [PATCH] vmscan: balancing fix
:
:    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
:
:      The shrink_zone() logic can, under some circumstances, cause far too many
:      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
:      hit a large number of reclaimable pages on the LRU.
:
:      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
:      reclaimed.
:
:    Problem is, this change caused significant imbalance in inter-zone scan
:    balancing by truncating scans of larger zones.
:
:    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
:    balancing algorithm would require that if we're scanning 100 pages of
:    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
:    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
:    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
:    harder than large ones.
:
:    Now I need to remember what the workload was which caused me to write this
:    patch originally, then fix it up in a different way...
And we haven't demonstrated that whatever problem caused that reversion is
not being reintroduced by this change in 2008.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-01-06 14:40:01 -08:00
										 |  |  | 				sc->nr_reclaimed += reclaim_state->reclaimed_slab; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:14:29 -08:00
										 |  |  | 				reclaim_state->reclaimed_slab = 0; | 
					
						
							|  |  |  | 			} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		total_scanned += sc->nr_scanned; | 
					
						
							| 
									
										
										
										
											2010-06-04 14:15:05 -07:00
										 |  |  | 		if (sc->nr_reclaimed >= sc->nr_to_reclaim) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			goto out; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:35:37 -08:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If we're getting trouble reclaiming, start doing | 
					
						
							|  |  |  | 		 * writepage even in laptop mode. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (sc->priority < DEF_PRIORITY - 2) | 
					
						
							|  |  |  | 			sc->may_writepage = 1; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Try to write back as many pages as we just scanned.  This | 
					
						
							|  |  |  | 		 * tends to cause slow streaming writers to write data to the | 
					
						
							|  |  |  | 		 * disk smoothly, at the dirtying rate, which is nice.   But | 
					
						
							|  |  |  | 		 * that's undesirable in laptop mode, where we *want* lumpy | 
					
						
							|  |  |  | 		 * writeout.  So in laptop mode, write out the whole world. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:10 -08:00
										 |  |  | 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2; | 
					
						
							|  |  |  | 		if (total_scanned > writeback_threshold) { | 
					
						
							| 
									
										
										
										
											2011-10-07 21:54:10 -06:00
										 |  |  | 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned, | 
					
						
							|  |  |  | 						WB_REASON_TRY_TO_FREE_PAGES); | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 			sc->may_writepage = 1; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2013-07-08 16:00:24 -07:00
										 |  |  | 	} while (--sc->priority >= 0 && !aborted_reclaim); | 
					
						
							| 
									
										
										
										
											2010-06-04 14:15:05 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | out: | 
					
						
							| 
									
										
										
										
											2008-07-25 01:48:52 -07:00
										 |  |  | 	delayacct_freepages_end(); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-06-04 14:15:05 -07:00
										 |  |  | 	if (sc->nr_reclaimed) | 
					
						
							|  |  |  | 		return sc->nr_reclaimed; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.
	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
And it went through strange history. firstly, following commit broke
the logic unintentionally.
	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
				      costly-order allocations
Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.
	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
				      return value when priority==0
But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .
	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
				      in direct reclaim path
But, recently, Andrey Vagin found the new corner case. Look,
	struct zone {
	  ..
	        int                     all_unreclaimable;
	  ..
	        unsigned long           pages_scanned;
	  ..
	}
zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock.  Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.
This resulted in the kernel hanging up when executing a loop of the form
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap
as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily.  and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
all!
Eventually, oom-killer never works on such systems.  That said, we can't
use zone->pages_scanned for this purpose.  This patch restore
all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-04-14 15:22:12 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * As hibernation is going on, kswapd is freezed so that it can't mark | 
					
						
							|  |  |  | 	 * the zone into all_unreclaimable. Thus bypassing all_unreclaimable | 
					
						
							|  |  |  | 	 * check. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (oom_killer_disabled) | 
					
						
							|  |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:49 -08:00
										 |  |  | 	/* Aborted reclaim to try compaction? don't OOM, then */ | 
					
						
							|  |  |  | 	if (aborted_reclaim) | 
					
						
							| 
									
										
										
										
											2012-01-12 17:19:33 -08:00
										 |  |  | 		return 1; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-06-04 14:15:05 -07:00
										 |  |  | 	/* top priority shrink_zones still had more to do? don't OOM, then */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:50 -08:00
										 |  |  | 	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc)) | 
					
						
							| 
									
										
										
										
											2010-06-04 14:15:05 -07:00
										 |  |  | 		return 1; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct zone *zone; | 
					
						
							|  |  |  | 	unsigned long pfmemalloc_reserve = 0; | 
					
						
							|  |  |  | 	unsigned long free_pages = 0; | 
					
						
							|  |  |  | 	int i; | 
					
						
							|  |  |  | 	bool wmark_ok; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	for (i = 0; i <= ZONE_NORMAL; i++) { | 
					
						
							|  |  |  | 		zone = &pgdat->node_zones[i]; | 
					
						
							|  |  |  | 		pfmemalloc_reserve += min_wmark_pages(zone); | 
					
						
							|  |  |  | 		free_pages += zone_page_state(zone, NR_FREE_PAGES); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	wmark_ok = free_pages > pfmemalloc_reserve / 2; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* kswapd must be awake if processes are being throttled */ | 
					
						
							|  |  |  | 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) { | 
					
						
							|  |  |  | 		pgdat->classzone_idx = min(pgdat->classzone_idx, | 
					
						
							|  |  |  | 						(enum zone_type)ZONE_NORMAL); | 
					
						
							|  |  |  | 		wake_up_interruptible(&pgdat->kswapd_wait); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return wmark_ok; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * Throttle direct reclaimers if backing storage is backed by the network | 
					
						
							|  |  |  |  * and the PFMEMALLOC reserve for the preferred node is getting dangerously | 
					
						
							|  |  |  |  * depleted. kswapd will continue to make progress and wake the processes | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  |  * when the low watermark is reached. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Returns true if a fatal signal was delivered during throttling. If this | 
					
						
							|  |  |  |  * happens, the page allocator should not consider triggering the OOM killer. | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 					nodemask_t *nodemask) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct zone *zone; | 
					
						
							|  |  |  | 	int high_zoneidx = gfp_zone(gfp_mask); | 
					
						
							|  |  |  | 	pg_data_t *pgdat; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Kernel threads should not be throttled as they may be indirectly | 
					
						
							|  |  |  | 	 * responsible for cleaning pages necessary for reclaim to make forward | 
					
						
							|  |  |  | 	 * progress. kjournald for example may enter direct reclaim while | 
					
						
							|  |  |  | 	 * committing a transaction where throttling it could forcing other | 
					
						
							|  |  |  | 	 * processes to block on log_wait_commit(). | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (current->flags & PF_KTHREAD) | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 		goto out; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If a fatal signal is pending, this process should not throttle. | 
					
						
							|  |  |  | 	 * It should return quickly so it can exit and free its memory | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (fatal_signal_pending(current)) | 
					
						
							|  |  |  | 		goto out; | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* Check if the pfmemalloc reserves are ok */ | 
					
						
							|  |  |  | 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone); | 
					
						
							|  |  |  | 	pgdat = zone->zone_pgdat; | 
					
						
							|  |  |  | 	if (pfmemalloc_watermark_ok(pgdat)) | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 		goto out; | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:39 -07:00
										 |  |  | 	/* Account for the throttling */ | 
					
						
							|  |  |  | 	count_vm_event(PGSCAN_DIRECT_THROTTLE); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If the caller cannot enter the filesystem, it's possible that it | 
					
						
							|  |  |  | 	 * is due to the caller holding an FS lock or performing a journal | 
					
						
							|  |  |  | 	 * transaction in the case of a filesystem like ext[3|4]. In this case, | 
					
						
							|  |  |  | 	 * it is not safe to block on pfmemalloc_wait as kswapd could be | 
					
						
							|  |  |  | 	 * blocked waiting on the same lock. Instead, throttle for up to a | 
					
						
							|  |  |  | 	 * second before continuing. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (!(gfp_mask & __GFP_FS)) { | 
					
						
							|  |  |  | 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, | 
					
						
							|  |  |  | 			pfmemalloc_watermark_ok(pgdat), HZ); | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		goto check_pending; | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* Throttle until kswapd wakes the process */ | 
					
						
							|  |  |  | 	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, | 
					
						
							|  |  |  | 		pfmemalloc_watermark_ok(pgdat)); | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | check_pending: | 
					
						
							|  |  |  | 	if (fatal_signal_pending(current)) | 
					
						
							|  |  |  | 		return true; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | out: | 
					
						
							|  |  |  | 	return false; | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-04-28 02:12:12 -07:00
										 |  |  | unsigned long try_to_free_pages(struct zonelist *zonelist, int order, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:31 -07:00
										 |  |  | 				gfp_t gfp_mask, nodemask_t *nodemask) | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | 	unsigned long nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 	struct scan_control sc = { | 
					
						
							| 
									
										
											  
											
												mm: teach mm by current context info to not do I/O during memory allocation
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
  GFP_KERNEL is called inside runtime resume callback of any one of its
  ancestors(or the block device itself), the deadlock may be triggered
  inside the memory allocation since it might not complete until the block
  device becomes active and the involed page I/O finishes.  The situation
  is pointed out first by Alan Stern.  It is not a good approach to
  convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
  subsystems may be involved(for example, PCI, USB and SCSI may be
  involved for usb mass stoarage device, network devices involved too in
  the iSCSI case)
- during block device runtime suspend, because runtime resume need to
  wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
  be put on the device, so there shouldn't have any memory allocation with
  GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
  above may be triggered.  Unfortunately, any usb device may include one
  mass storage interface in theory, so it requires all usb interface
  drivers to handle the situation.  In fact, most usb drivers don't know
  how to handle bus reset on the device and don't provide .pre_set() and
  .post_reset() callback at all, so USB core has to unbind and bind driver
  for these devices.  So it is still not practical to resort to GFP_NOIO
  for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found.  Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-02-22 16:34:08 -08:00
										 |  |  | 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		.may_writepage = !laptop_mode, | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:10 -08:00
										 |  |  | 		.nr_to_reclaim = SWAP_CLUSTER_MAX, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 		.may_unmap = 1, | 
					
						
							| 
									
										
										
										
											2009-04-21 12:24:57 -07:00
										 |  |  | 		.may_swap = 1, | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		.order = order, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		.priority = DEF_PRIORITY, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 		.target_mem_cgroup = NULL, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:31 -07:00
										 |  |  | 		.nodemask = nodemask, | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	struct shrink_control shrink = { | 
					
						
							|  |  |  | 		.gfp_mask = sc.gfp_mask, | 
					
						
							|  |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 	 * Do not enter reclaim if fatal signal was delivered while throttled. | 
					
						
							|  |  |  | 	 * 1 is returned so that the page allocator does not OOM kill at this | 
					
						
							|  |  |  | 	 * point. | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-11-26 16:29:48 -08:00
										 |  |  | 	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 		return 1; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | 	trace_mm_vmscan_direct_reclaim_begin(order, | 
					
						
							|  |  |  | 				sc.may_writepage, | 
					
						
							|  |  |  | 				gfp_mask); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:43:02 -07:00
										 |  |  | #ifdef CONFIG_MEMCG
 | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:32 -08:00
										 |  |  | unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 						gfp_t gfp_mask, bool noswap, | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:25 -07:00
										 |  |  | 						struct zone *zone, | 
					
						
							|  |  |  | 						unsigned long *nr_scanned) | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	struct scan_control sc = { | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:25 -07:00
										 |  |  | 		.nr_scanned = 0, | 
					
						
							| 
									
										
										
										
											2010-08-10 18:03:02 -07:00
										 |  |  | 		.nr_to_reclaim = SWAP_CLUSTER_MAX, | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 		.may_writepage = !laptop_mode, | 
					
						
							|  |  |  | 		.may_unmap = 1, | 
					
						
							|  |  |  | 		.may_swap = !noswap, | 
					
						
							|  |  |  | 		.order = 0, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		.priority = 0, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:32 -08:00
										 |  |  | 		.target_mem_cgroup = memcg, | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:02 -07:00
										 |  |  | 	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:25 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | | 
					
						
							|  |  |  | 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order, | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 						      sc.may_writepage, | 
					
						
							|  |  |  | 						      sc.gfp_mask); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * NOTE: Although we can get the priority field, using it | 
					
						
							|  |  |  | 	 * here is not a good idea, since it limits the pages we can scan. | 
					
						
							|  |  |  | 	 * if we don't reclaim here, the shrink_zone from balance_pgdat | 
					
						
							|  |  |  | 	 * will pick up pages from other mem cgroup's as well. We hack | 
					
						
							|  |  |  | 	 * the priority and make it zero. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:02 -07:00
										 |  |  | 	shrink_lruvec(lruvec, &sc); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:25 -07:00
										 |  |  | 	*nr_scanned = sc.nr_scanned; | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 	return sc.nr_reclaimed; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:32 -08:00
										 |  |  | unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, | 
					
						
							| 
									
										
										
										
											2009-01-07 18:08:24 -08:00
										 |  |  | 					   gfp_t gfp_mask, | 
					
						
							| 
									
										
										
										
											2011-09-14 16:21:58 -07:00
										 |  |  | 					   bool noswap) | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 	struct zonelist *zonelist; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 	unsigned long nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:33 -07:00
										 |  |  | 	int nid; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 	struct scan_control sc = { | 
					
						
							|  |  |  | 		.may_writepage = !laptop_mode, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 		.may_unmap = 1, | 
					
						
							| 
									
										
										
										
											2009-04-21 12:24:57 -07:00
										 |  |  | 		.may_swap = !noswap, | 
					
						
							| 
									
										
										
										
											2009-12-14 17:59:10 -08:00
										 |  |  | 		.nr_to_reclaim = SWAP_CLUSTER_MAX, | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 		.order = 0, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		.priority = DEF_PRIORITY, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:32 -08:00
										 |  |  | 		.target_mem_cgroup = memcg, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:23:31 -07:00
										 |  |  | 		.nodemask = NULL, /* we don't care the placement */ | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | | 
					
						
							|  |  |  | 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), | 
					
						
							|  |  |  | 	}; | 
					
						
							|  |  |  | 	struct shrink_control shrink = { | 
					
						
							|  |  |  | 		.gfp_mask = sc.gfp_mask, | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | 	}; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:33 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't | 
					
						
							|  |  |  | 	 * take care of from where we get pages. So the node where we start the | 
					
						
							|  |  |  | 	 * scan does not need to be the current node. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:32 -08:00
										 |  |  | 	nid = mem_cgroup_select_victim_node(memcg); | 
					
						
							| 
									
										
										
										
											2011-05-26 16:25:33 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	zonelist = NODE_DATA(nid)->node_zonelists; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_vmscan_memcg_reclaim_begin(0, | 
					
						
							|  |  |  | 					    sc.may_writepage, | 
					
						
							|  |  |  | 					    sc.gfp_mask); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:56 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2008-02-07 00:13:56 -08:00
										 |  |  | } | 
					
						
							|  |  |  | #endif
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | static void age_active_anon(struct zone *zone, struct scan_control *sc) | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:06 -08:00
										 |  |  | 	struct mem_cgroup *memcg; | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:06 -08:00
										 |  |  | 	if (!total_swap_pages) | 
					
						
							|  |  |  | 		return; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	memcg = mem_cgroup_iter(NULL, NULL, NULL); | 
					
						
							|  |  |  | 	do { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:06 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:00 -07:00
										 |  |  | 		if (inactive_anon_is_low(lruvec)) | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:01 -07:00
										 |  |  | 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 					   sc, LRU_ACTIVE_ANON); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:06 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		memcg = mem_cgroup_iter(NULL, memcg, NULL); | 
					
						
							|  |  |  | 	} while (memcg); | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-11-29 13:54:23 -08:00
										 |  |  | static bool zone_balanced(struct zone *zone, int order, | 
					
						
							|  |  |  | 			  unsigned long balance_gap, int classzone_idx) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + | 
					
						
							|  |  |  | 				    balance_gap, classzone_idx, 0)) | 
					
						
							|  |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-12-11 16:00:31 -08:00
										 |  |  | 	if (IS_ENABLED(CONFIG_COMPACTION) && order && | 
					
						
							|  |  |  | 	    !compaction_suitable(zone, order)) | 
					
						
							| 
									
										
										
										
											2012-11-29 13:54:23 -08:00
										 |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return true; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  |  * pgdat_balanced() is used when checking if a node is balanced. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * For order-0, all zones must be balanced! | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * For high-order allocations only zones that meet watermarks and are in a | 
					
						
							|  |  |  |  * zone allowed by the callers classzone_idx are added to balanced_pages. The | 
					
						
							|  |  |  |  * total of balanced pages must be at least 25% of the zones allowed by | 
					
						
							|  |  |  |  * classzone_idx for the node to be considered balanced. Forcing all zones to | 
					
						
							|  |  |  |  * be balanced for high orders can cause excessive reclaim when there are | 
					
						
							|  |  |  |  * imbalanced zones. | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  |  * The choice of 25% is due to | 
					
						
							|  |  |  |  *   o a 16M DMA zone that is balanced will not balance a zone on any | 
					
						
							|  |  |  |  *     reasonable sized machine | 
					
						
							|  |  |  |  *   o On all other machines, the top zone must be at least a reasonable | 
					
						
							| 
									
										
										
										
											2011-03-30 22:57:33 -03:00
										 |  |  |  *     percentage of the middle zones. For example, on 32-bit x86, highmem | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  |  *     would need to be at least 256M for it to be balance a whole node. | 
					
						
							|  |  |  |  *     Similarly, on x86-64 the Normal zone would need to be at least 1G | 
					
						
							|  |  |  |  *     to balance a node on its own. These seemed like reasonable ratios. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 	unsigned long managed_pages = 0; | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 	unsigned long balanced_pages = 0; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  | 	int i; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 	/* Check the watermark levels */ | 
					
						
							|  |  |  | 	for (i = 0; i <= classzone_idx; i++) { | 
					
						
							|  |  |  | 		struct zone *zone = pgdat->node_zones + i; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 		if (!populated_zone(zone)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 		managed_pages += zone->managed_pages; | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * A special case here: | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * balance_pgdat() skips over all_unreclaimable after | 
					
						
							|  |  |  | 		 * DEF_PRIORITY. Effectively, it considers them balanced so | 
					
						
							|  |  |  | 		 * they must be considered balanced here as well! | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 		if (!zone_reclaimable(zone)) { | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 			balanced_pages += zone->managed_pages; | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 			continue; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		if (zone_balanced(zone, order, 0, i)) | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 			balanced_pages += zone->managed_pages; | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 		else if (!order) | 
					
						
							|  |  |  | 			return false; | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (order) | 
					
						
							| 
									
										
										
										
											2013-02-22 16:33:52 -08:00
										 |  |  | 		return balanced_pages >= (managed_pages >> 2); | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 	else | 
					
						
							|  |  |  | 		return true; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:21 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Prepare kswapd for sleeping. This verifies that there are no processes | 
					
						
							|  |  |  |  * waiting in throttle_direct_reclaim() and that watermarks have been met. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Returns true if kswapd is ready to sleep | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:26 -08:00
										 |  |  | 					int classzone_idx) | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:53 -08:00
										 |  |  | { | 
					
						
							|  |  |  | 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */ | 
					
						
							|  |  |  | 	if (remaining) | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 		return false; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * There is a potential race between when kswapd checks its watermarks | 
					
						
							|  |  |  | 	 * and a process gets throttled. There is also a potential race if | 
					
						
							|  |  |  | 	 * processes get throttled, kswapd wakes, a large process exits therby | 
					
						
							|  |  |  | 	 * balancing the zones that causes kswapd to miss a wakeup. If kswapd | 
					
						
							|  |  |  | 	 * is going to sleep, no process should be sleeping on pfmemalloc_wait | 
					
						
							|  |  |  | 	 * so wake them now if necessary. If necessary, processes will wake | 
					
						
							|  |  |  | 	 * kswapd and get throttled again | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (waitqueue_active(&pgdat->pfmemalloc_wait)) { | 
					
						
							|  |  |  | 		wake_up(&pgdat->pfmemalloc_wait); | 
					
						
							|  |  |  | 		return false; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:53 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-12-23 15:12:54 +01:00
										 |  |  | 	return pgdat_balanced(pgdat, order, classzone_idx); | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:53 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * kswapd shrinks the zone by the number of pages required to reach | 
					
						
							|  |  |  |  * the high watermark. | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * Returns true if kswapd scanned at least the requested number of pages to | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  |  * reclaim or if the lack of progress was due to pages under writeback. | 
					
						
							|  |  |  |  * This is used to determine if the scanning priority needs to be raised. | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | static bool kswapd_shrink_zone(struct zone *zone, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 			       int classzone_idx, | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 			       struct scan_control *sc, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 			       unsigned long lru_pages, | 
					
						
							|  |  |  | 			       unsigned long *nr_attempted) | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 	int testorder = sc->order; | 
					
						
							|  |  |  | 	unsigned long balance_gap; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 	struct reclaim_state *reclaim_state = current->reclaim_state; | 
					
						
							|  |  |  | 	struct shrink_control shrink = { | 
					
						
							|  |  |  | 		.gfp_mask = sc->gfp_mask, | 
					
						
							|  |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 	bool lowmem_pressure; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/* Reclaim above the high watermark. */ | 
					
						
							|  |  |  | 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Kswapd reclaims only single pages with compaction enabled. Trying | 
					
						
							|  |  |  | 	 * too hard to reclaim until contiguous free pages have become | 
					
						
							|  |  |  | 	 * available can hurt performance by evicting too much useful data | 
					
						
							|  |  |  | 	 * from memory. Do not reclaim more than needed for compaction. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && | 
					
						
							|  |  |  | 			compaction_suitable(zone, sc->order) != | 
					
						
							|  |  |  | 				COMPACT_SKIPPED) | 
					
						
							|  |  |  | 		testorder = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * We put equal pressure on every zone, unless one zone has way too | 
					
						
							|  |  |  | 	 * many pages free already. The "too many pages" is defined as the | 
					
						
							|  |  |  | 	 * high wmark plus a "gap" where the gap is either the low | 
					
						
							|  |  |  | 	 * watermark or 1% of the zone, whichever is smaller. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	balance_gap = min(low_wmark_pages(zone), | 
					
						
							|  |  |  | 		(zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / | 
					
						
							|  |  |  | 		KSWAPD_ZONE_BALANCE_GAP_RATIO); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If there is no low memory pressure or the zone is balanced then no | 
					
						
							|  |  |  | 	 * reclaim is necessary | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); | 
					
						
							|  |  |  | 	if (!lowmem_pressure && zone_balanced(zone, testorder, | 
					
						
							|  |  |  | 						balance_gap, classzone_idx)) | 
					
						
							|  |  |  | 		return true; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 	shrink_zone(zone, sc); | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:03 +10:00
										 |  |  | 	nodes_clear(shrink.nodes_to_scan); | 
					
						
							|  |  |  | 	node_set(zone_to_nid(zone), shrink.nodes_to_scan); | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	reclaim_state->reclaimed_slab = 0; | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 	shrink_slab(&shrink, sc->nr_scanned, lru_pages); | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | 	sc->nr_reclaimed += reclaim_state->reclaimed_slab; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 	/* Account for the number of pages attempted to reclaim */ | 
					
						
							|  |  |  | 	*nr_attempted += sc->nr_to_reclaim; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:51 -07:00
										 |  |  | 	zone_clear_flag(zone, ZONE_WRITEBACK); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If a zone reaches its high watermark, consider it to be no longer | 
					
						
							|  |  |  | 	 * congested. It's possible there are dirty pages backed by congested | 
					
						
							|  |  |  | 	 * BDIs but as pressure is relieved, speculatively avoid congestion | 
					
						
							|  |  |  | 	 * waits. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 	if (zone_reclaimable(zone) && | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 	    zone_balanced(zone, testorder, 0, classzone_idx)) { | 
					
						
							|  |  |  | 		zone_clear_flag(zone, ZONE_CONGESTED); | 
					
						
							|  |  |  | 		zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 	return sc->nr_scanned >= sc->nr_to_reclaim; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: limit the number of pages kswapd reclaims at each priority
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
 - Drop the slab shrink changes in light of Glaubers series and
   discussions highlighted that there were a number of potential
   problems with the patch.					(mel)
 - Rebased to 3.10-rc1
Changelog since V2
 - Preserve ratio properly for proportional scanning		(kamezawa)
Changelog since V1
 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
 - Reformat comment in shrink_page_list				(andi)
 - Clarify some comments					(dhillf)
 - Rework how the proportional scanning is preserved
 - Add PageReclaim check before kswapd starts writeback
 - Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time.  Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts.  In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
	the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
	deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
	scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
	is not necessarily related to dirty pages. It will have kswapd
	writeback pages if a number of unqueued dirty pages have been
	recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
	to reduce LRU churn and the likelihood that it'll reclaim young
	clean pages or push applications to swap. It will cause kswapd
	to block on IO if it detects that pages being reclaimed under
	writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
	are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background.  This drop in performance is part of
what users complain of when they start backups.  Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
	the vanilla kernel note that performance drops from around
	22K/sec to just under 4K/second when there is 2385M of IO going
	on in the background. This is one type of performance collapse
	users complain about if a large cp or backup starts in the
	background
io-duration refers to how long it takes for the background IO to
	complete. It's showing that with the patched kernel that the IO
	completes faster while not interfering with the memcache
	workload
swaptotal is the total amount of swap traffic. With the patched kernel,
	the total amount of swapping is much reduced although it is
	still not zero.
swapin in this case is an indication as to whether we are swap trashing.
	The closer the swapin/swapout ratio is to 1, the worse the
	trashing is.  Note with the patched kernel that there is no swapin
	activity indicating that all the pages swapped were really inactive
	unused pages.
minorfaults are just minor faults. An increased number of minor faults
	can indicate that page reclaim is unmapping the pages but not
	swapping them out before they are faulted back in. With the
	patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
	number can indicate that a workload is being prematurely
	swapped. With the patched kernel, major faults are much reduced. As
	there are no swapin's recorded so it's not being swapped. The likely
	explanation is that that libraries or configuration files used by
	the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-07-03 15:01:42 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * For kswapd, balance_pgdat() will work across all this node's zones until | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:12 -07:00
										 |  |  |  * they are all at high_wmark_pages(zone). | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:22 -08:00
										 |  |  |  * Returns the final order kswapd was reclaiming at | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * There is special handling here for zones which are full of pinned pages. | 
					
						
							|  |  |  |  * This can happen if the pages are all mlocked, or if they are all used by | 
					
						
							|  |  |  |  * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb. | 
					
						
							|  |  |  |  * What we do is to detect the case where all pages in the zone have been | 
					
						
							|  |  |  |  * scanned twice and there has been zero successful reclaim.  Mark the zone as | 
					
						
							|  |  |  |  * dead and from now on, only perform a short scan.  Basically we're polling | 
					
						
							|  |  |  |  * the zone for when the problem goes away. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * kswapd scans the zones in the highmem->normal->dma direction.  It skips | 
					
						
							| 
									
										
										
										
											2009-06-16 15:32:12 -07:00
										 |  |  |  * zones which have free_pages > high_wmark_pages(zone), but once a zone is | 
					
						
							|  |  |  |  * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the | 
					
						
							|  |  |  |  * lower zones regardless of the number of free pages in the lower zones. This | 
					
						
							|  |  |  |  * interoperates with the page allocator fallback scheme to ensure that aging | 
					
						
							|  |  |  |  * of pages is balanced across the zones. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | static unsigned long balance_pgdat(pg_data_t *pgdat, int order, | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:26 -08:00
										 |  |  | 							int *classzone_idx) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	int i; | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */ | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:41 -07:00
										 |  |  | 	unsigned long nr_soft_reclaimed; | 
					
						
							|  |  |  | 	unsigned long nr_soft_scanned; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	struct scan_control sc = { | 
					
						
							|  |  |  | 		.gfp_mask = GFP_KERNEL, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		.priority = DEF_PRIORITY, | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 		.may_unmap = 1, | 
					
						
							| 
									
										
										
										
											2009-04-21 12:24:57 -07:00
										 |  |  | 		.may_swap = 1, | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		.may_writepage = !laptop_mode, | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:16 -07:00
										 |  |  | 		.order = order, | 
					
						
							| 
									
										
										
										
											2012-01-12 17:17:52 -08:00
										 |  |  | 		.target_mem_cgroup = NULL, | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2006-06-30 01:55:45 -07:00
										 |  |  | 	count_vm_event(PAGEOUTRUN); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 	do { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		unsigned long lru_pages = 0; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 		unsigned long nr_attempted = 0; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		bool raise_priority = true; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 		bool pgdat_needs_compaction = (order > 0); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		sc.nr_reclaimed = 0; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Scan in the highmem->dma direction for the highest | 
					
						
							|  |  |  | 		 * zone which needs scanning | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		for (i = pgdat->nr_zones - 1; i >= 0; i--) { | 
					
						
							|  |  |  | 			struct zone *zone = pgdat->node_zones + i; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 			if (!populated_zone(zone)) | 
					
						
							|  |  |  | 				continue; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 			if (sc.priority != DEF_PRIORITY && | 
					
						
							|  |  |  | 			    !zone_reclaimable(zone)) | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 				continue; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:34 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Do some background aging of the anon list, to give | 
					
						
							|  |  |  | 			 * pages a chance to be referenced before reclaiming. | 
					
						
							|  |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 			age_active_anon(zone, &sc); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:34 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-03-21 16:34:00 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * If the number of buffer_heads in the machine | 
					
						
							|  |  |  | 			 * exceeds the maximum allowed level and this node | 
					
						
							|  |  |  | 			 * has a highmem zone, force kswapd to reclaim from | 
					
						
							|  |  |  | 			 * it to relieve lowmem pressure. | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			if (buffer_heads_over_limit && is_highmem_idx(i)) { | 
					
						
							|  |  |  | 				end_zone = i; | 
					
						
							|  |  |  | 				break; | 
					
						
							|  |  |  | 			} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-11-29 13:54:23 -08:00
										 |  |  | 			if (!zone_balanced(zone, order, 0, 0)) { | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 				end_zone = i; | 
					
						
							| 
									
										
										
										
											2006-12-06 20:32:01 -08:00
										 |  |  | 				break; | 
					
						
							| 
									
										
										
										
											2011-08-25 15:59:12 -07:00
										 |  |  | 			} else { | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 				/*
 | 
					
						
							|  |  |  | 				 * If balanced, clear the dirty and congested | 
					
						
							|  |  |  | 				 * flags | 
					
						
							|  |  |  | 				 */ | 
					
						
							| 
									
										
										
										
											2011-08-25 15:59:12 -07:00
										 |  |  | 				zone_clear_flag(zone, ZONE_CONGESTED); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:50 -07:00
										 |  |  | 				zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			} | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:34 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		if (i < 0) | 
					
						
							| 
									
										
										
										
											2006-12-06 20:32:01 -08:00
										 |  |  | 			goto out; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		for (i = 0; i <= end_zone; i++) { | 
					
						
							|  |  |  | 			struct zone *zone = pgdat->node_zones + i; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 			if (!populated_zone(zone)) | 
					
						
							|  |  |  | 				continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:42 -07:00
										 |  |  | 			lru_pages += zone_reclaimable_pages(zone); | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * If any zone is currently balanced then kswapd will | 
					
						
							|  |  |  | 			 * not call compaction as it is expected that the | 
					
						
							|  |  |  | 			 * necessary pages are already available. | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			if (pgdat_needs_compaction && | 
					
						
							|  |  |  | 					zone_watermark_ok(zone, order, | 
					
						
							|  |  |  | 						low_wmark_pages(zone), | 
					
						
							|  |  |  | 						*classzone_idx, 0)) | 
					
						
							|  |  |  | 				pgdat_needs_compaction = false; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:53 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If we're getting trouble reclaiming, start doing writepage | 
					
						
							|  |  |  | 		 * even in laptop mode. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (sc.priority < DEF_PRIORITY - 2) | 
					
						
							|  |  |  | 			sc.may_writepage = 1; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Now scan the zone in the dma->highmem direction, stopping | 
					
						
							|  |  |  | 		 * at the last zone which needs scanning. | 
					
						
							|  |  |  | 		 * | 
					
						
							|  |  |  | 		 * We do this because the page allocator works in the opposite | 
					
						
							|  |  |  | 		 * direction.  This prevents the page allocator from allocating | 
					
						
							|  |  |  | 		 * pages behind kswapd's direction of progress, which would | 
					
						
							|  |  |  | 		 * cause too much scanning of the lower zones. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		for (i = 0; i <= end_zone; i++) { | 
					
						
							|  |  |  | 			struct zone *zone = pgdat->node_zones + i; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-01-06 00:11:15 -08:00
										 |  |  | 			if (!populated_zone(zone)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				continue; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 			if (sc.priority != DEF_PRIORITY && | 
					
						
							|  |  |  | 			    !zone_reclaimable(zone)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				continue; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			sc.nr_scanned = 0; | 
					
						
							| 
									
										
										
										
											2009-09-23 15:56:39 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-24 15:27:41 -07:00
										 |  |  | 			nr_soft_scanned = 0; | 
					
						
							|  |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Call soft limit reclaim before calling shrink_zone. | 
					
						
							|  |  |  | 			 */ | 
					
						
							|  |  |  | 			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, | 
					
						
							|  |  |  | 							order, sc.gfp_mask, | 
					
						
							|  |  |  | 							&nr_soft_scanned); | 
					
						
							|  |  |  | 			sc.nr_reclaimed += nr_soft_reclaimed; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: prevent kswapd from freeing excessive amounts of lowmem
The current VM can get itself into trouble fairly easily on systems with a
small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.
On one side, page_alloc() will allocate down to zone->pages_low, while on
the other side, kswapd() and balance_pgdat() will try to free memory from
every zone, until every zone has more free pages than zone->pages_high.
Highmem can be filled up to zone->pages_low with page tables, ramfs,
vmalloc allocations and other unswappable things quite easily and without
many bad side effects, since we still have a huge ZONE_NORMAL to do future
allocations from.
However, as long as the number of free pages in the highmem zone is below
zone->pages_high, kswapd will continue swapping things out from
ZONE_NORMAL, too!
Sami Farin managed to get his system into a stage where kswapd had freed
about 700MB of low memory and was still "going strong".
The attached patch will make kswapd stop paging out data from zones when
there is more than enough memory free.  We do go above zone->pages_high in
order to keep pressure between zones equal in normal circumstances, but the
patch should prevent the kind of excesses that made Sami's computer totally
unusable.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2007-10-16 01:24:50 -07:00
										 |  |  | 			/*
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 			 * There should be no need to raise the scanning | 
					
						
							|  |  |  | 			 * priority if enough pages are already being scanned | 
					
						
							|  |  |  | 			 * that that high watermark would be met at 100% | 
					
						
							|  |  |  | 			 * efficiency. | 
					
						
							| 
									
										
										
										
											2012-03-21 16:33:51 -07:00
										 |  |  | 			 */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:54 -07:00
										 |  |  | 			if (kswapd_shrink_zone(zone, end_zone, &sc, | 
					
						
							|  |  |  | 					lru_pages, &nr_attempted)) | 
					
						
							|  |  |  | 				raise_priority = false; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If the low watermark is met there is no need for processes | 
					
						
							|  |  |  | 		 * to be throttled on pfmemalloc_wait as they should not be | 
					
						
							|  |  |  | 		 * able to safely make forward progress. Wake them | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (waitqueue_active(&pgdat->pfmemalloc_wait) && | 
					
						
							|  |  |  | 				pfmemalloc_watermark_ok(pgdat)) | 
					
						
							|  |  |  | 			wake_up(&pgdat->pfmemalloc_wait); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		 * Fragmentation may mean that the system cannot be rebalanced | 
					
						
							|  |  |  | 		 * for high-order allocations in all zones. If twice the | 
					
						
							|  |  |  | 		 * allocation size has been reclaimed and the zones are still | 
					
						
							|  |  |  | 		 * not balanced then recheck the watermarks at order-0 to | 
					
						
							|  |  |  | 		 * prevent kswapd reclaiming excessively. Assume that a | 
					
						
							|  |  |  | 		 * process requested a high-order can direct reclaim/compact. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		if (order && sc.nr_reclaimed >= 2UL << order) | 
					
						
							|  |  |  | 			order = sc.order = 0; | 
					
						
							| 
									
										
											  
											
												[PATCH] swsusp: Improve handling of highmem
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg.  it requires two normal pages
to be used for saving one highmem page).  This may be improved by using
highmem for saving the contents of saveable highmem pages.
Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory.  Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data.  We use a memory bitmap to mark the allocated
free pages (ie.  highmem as well as "normal" image pages).
Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages.  Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.
During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames.  Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image.  While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.
Now, the image data are loaded, if possible, into their "original" page
frames.  The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.
One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie.  by the architecture-dependent parts of swsusp).  The
other list of PBEs is for the copies of highmem suspend pages.  The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-12-06 20:34:18 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		/* Check if kswapd should be suspending */ | 
					
						
							|  |  |  | 		if (try_to_freeze() || kthread_should_stop()) | 
					
						
							|  |  |  | 			break; | 
					
						
							| 
									
										
											  
											
												[PATCH] swsusp: Improve handling of highmem
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg.  it requires two normal pages
to be used for saving one highmem page).  This may be improved by using
highmem for saving the contents of saveable highmem pages.
Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory.  Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data.  We use a memory bitmap to mark the allocated
free pages (ie.  highmem as well as "normal" image pages).
Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages.  Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.
During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames.  Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image.  While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.
Now, the image data are loaded, if possible, into their "original" page
frames.  The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.
One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie.  by the architecture-dependent parts of swsusp).  The
other list of PBEs is for the copies of highmem suspend pages.  The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
											
										 
											2006-12-06 20:34:18 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:47 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Compact if necessary and kswapd is reclaiming at least the | 
					
						
							|  |  |  | 		 * high watermark number of pages as requsted | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted) | 
					
						
							|  |  |  | 			compact_pgdat(pgdat, order); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-06 14:40:33 -08:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		 * Raise priority if scanning rate is too low or there was no | 
					
						
							|  |  |  | 		 * progress in reclaiming pages | 
					
						
							| 
									
										
										
										
											2009-01-06 14:40:33 -08:00
										 |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		if (raise_priority || !sc.nr_reclaimed) | 
					
						
							|  |  |  | 			sc.priority--; | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:48 -07:00
										 |  |  | 	} while (sc.priority >= 1 && | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | 		 !pgdat_balanced(pgdat, order, *classzone_idx)); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-07-03 15:01:45 -07:00
										 |  |  | out: | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:22 -08:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	 * Return the order we were reclaiming at so prepare_kswapd_sleep() | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:22 -08:00
										 |  |  | 	 * makes a decision on the order we were last reclaiming at. However, | 
					
						
							|  |  |  | 	 * if another caller entered the allocator slow path while kswapd | 
					
						
							|  |  |  | 	 * was awake, order will remain at the higher level | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:26 -08:00
										 |  |  | 	*classzone_idx = end_zone; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:22 -08:00
										 |  |  | 	return order; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:26 -08:00
										 |  |  | static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:50 -08:00
										 |  |  | { | 
					
						
							|  |  |  | 	long remaining = 0; | 
					
						
							|  |  |  | 	DEFINE_WAIT(wait); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (freezing(current) || kthread_should_stop()) | 
					
						
							|  |  |  | 		return; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* Try to sleep for a short interval */ | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:50 -08:00
										 |  |  | 		remaining = schedule_timeout(HZ/10); | 
					
						
							|  |  |  | 		finish_wait(&pgdat->kswapd_wait, &wait); | 
					
						
							|  |  |  | 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * After a short sleep, check if it was a premature sleep. If not, then | 
					
						
							|  |  |  | 	 * go fully to sleep until explicitly woken up. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2012-07-31 16:44:35 -07:00
										 |  |  | 	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:50 -08:00
										 |  |  | 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * vmstat counters are not perfectly accurate and the estimated | 
					
						
							|  |  |  | 		 * value for counters such as NR_FREE_PAGES can deviate from the | 
					
						
							|  |  |  | 		 * true value by nr_online_cpus * threshold. To avoid the zone | 
					
						
							|  |  |  | 		 * watermarks being breached while under pressure, we reduce the | 
					
						
							|  |  |  | 		 * per-cpu vmstat threshold while kswapd is awake and restore | 
					
						
							|  |  |  | 		 * them before going back to sleep. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); | 
					
						
							| 
									
										
										
										
											2012-07-17 15:48:07 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:32:47 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Compaction records what page blocks it recently failed to | 
					
						
							|  |  |  | 		 * isolate pages from and skips them in the future scanning. | 
					
						
							|  |  |  | 		 * When kswapd is going to sleep, it is reasonable to assume | 
					
						
							|  |  |  | 		 * that pages and compaction may succeed so reset the cache. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		reset_isolation_suitable(pgdat); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-17 15:48:07 -07:00
										 |  |  | 		if (!kthread_should_stop()) | 
					
						
							|  |  |  | 			schedule(); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:50 -08:00
										 |  |  | 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold); | 
					
						
							|  |  |  | 	} else { | 
					
						
							|  |  |  | 		if (remaining) | 
					
						
							|  |  |  | 			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); | 
					
						
							|  |  |  | 		else | 
					
						
							|  |  |  | 			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY); | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	finish_wait(&pgdat->kswapd_wait, &wait); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * The background pageout daemon, started as a kernel thread | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  |  * from the init process. | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * This basically trickles out pages so that we have _some_ | 
					
						
							|  |  |  |  * free memory available even if there is no other activity | 
					
						
							|  |  |  |  * that frees anything up. This is needed for things like routing | 
					
						
							|  |  |  |  * etc, where we otherwise might have all activity going on in | 
					
						
							|  |  |  |  * asynchronous contexts that cannot page things out. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If there are applications that are active memory-allocators | 
					
						
							|  |  |  |  * (most normal use), this basically shouldn't matter. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | static int kswapd(void *p) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 	unsigned long order, new_order; | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 	unsigned balanced_order; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 	int classzone_idx, new_classzone_idx; | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 	int balanced_classzone_idx; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	pg_data_t *pgdat = (pg_data_t*)p; | 
					
						
							|  |  |  | 	struct task_struct *tsk = current; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:45:50 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	struct reclaim_state reclaim_state = { | 
					
						
							|  |  |  | 		.reclaimed_slab = 0, | 
					
						
							|  |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2009-03-13 14:49:46 +10:30
										 |  |  | 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												lockdep: annotate reclaim context (__GFP_NOFS)
Here is another version, with the incremental patch rolled up, and
added reclaim context annotation to kswapd, and allocation tracing
to slab allocators (which may only ever reach the page allocator
in rare cases, so it is good to put annotations here too).
Haven't tested this version as such, but it should be getting closer
to merge worthy ;)
--
After noticing some code in mm/filemap.c accidentally perform a __GFP_FS
allocation when it should not have been, I thought it might be a good idea to
try to catch this kind of thing with lockdep.
I coded up a little idea that seems to work. Unfortunately the system has to
actually be in __GFP_FS page reclaim, then take the lock, before it will mark
it. But at least that might still be some orders of magnitude more common
(and more debuggable) than an actual deadlock condition, so we have some
improvement I hope (the concept is no less complete than discovery of a lock's
interrupt contexts).
I guess we could even do the same thing with __GFP_IO (normal reclaim), and
even GFP_NOIO locks too... but filesystems will have the most locks and fiddly
code paths, so let's start there and see how it goes.
It *seems* to work. I did a quick test.
=================================
[ INFO: inconsistent lock state ]
2.6.28-rc6-00007-ged31348-dirty #26
---------------------------------
inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage.
modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
{in-reclaim-W} state was registered at:
  [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60
  [<ffffffff80268f71>] lock_acquire+0x91/0xc0
  [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310
  [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd]
  [<ffffffff8020903b>] _stext+0x3b/0x170
  [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
  [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 3929
hardirqs last  enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310
hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310
softirqs last  enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0
softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0
other info that might help us debug this:
1 lock held by modprobe/8526:
 #0:  (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
stack backtrace:
Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26
Call Trace:
 [<ffffffff80265483>] print_usage_bug+0x193/0x1d0
 [<ffffffff80266530>] mark_lock+0xaf0/0xca0
 [<ffffffff80266735>] mark_held_locks+0x55/0xc0
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60
 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580
 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd]
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffff8020903b>] _stext+0x3b/0x170
 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10
 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180
 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190
 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
											
										 
											2009-01-21 08:12:39 +01:00
										 |  |  | 	lockdep_set_current_reclaim_state(GFP_KERNEL); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-01 10:12:29 +10:30
										 |  |  | 	if (!cpumask_empty(cpumask)) | 
					
						
							| 
									
										
										
										
											2008-04-04 18:11:10 -07:00
										 |  |  | 		set_cpus_allowed_ptr(tsk, cpumask); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	current->reclaim_state = &reclaim_state; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Tell the memory management that we're a "memory allocator", | 
					
						
							|  |  |  | 	 * and that if we need more memory we should get access to it | 
					
						
							|  |  |  | 	 * regardless (see "__alloc_pages()"). "kswapd" should | 
					
						
							|  |  |  | 	 * never get caught in the normal page freeing logic. | 
					
						
							|  |  |  | 	 * | 
					
						
							|  |  |  | 	 * (Kswapd normally doesn't need memory anyway, but sometimes | 
					
						
							|  |  |  | 	 * you need a small amount of memory in order to be able to | 
					
						
							|  |  |  | 	 * page out something else, and this flag essentially protects | 
					
						
							|  |  |  | 	 * us from recursively trying to free more memory as we're | 
					
						
							|  |  |  | 	 * trying to free the first piece of memory in the first place). | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2006-01-08 01:00:47 -08:00
										 |  |  | 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; | 
					
						
							| 
									
										
										
										
											2007-07-17 04:03:35 -07:00
										 |  |  | 	set_freezable(); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 	order = new_order = 0; | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 	balanced_order = 0; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 	balanced_classzone_idx = classzone_idx; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	for ( ; ; ) { | 
					
						
							| 
									
										
										
										
											2012-12-11 16:02:48 -08:00
										 |  |  | 		bool ret; | 
					
						
							| 
									
										
										
										
											2005-06-24 23:13:50 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * If the last balance_pgdat was unsuccessful it's unlikely a | 
					
						
							|  |  |  | 		 * new request of a similar or harder type will succeed soon | 
					
						
							|  |  |  | 		 * so consider going to sleep on the basis we reclaimed at | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 		if (balanced_classzone_idx >= new_classzone_idx && | 
					
						
							|  |  |  | 					balanced_order == new_order) { | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 			new_order = pgdat->kswapd_max_order; | 
					
						
							|  |  |  | 			new_classzone_idx = pgdat->classzone_idx; | 
					
						
							|  |  |  | 			pgdat->kswapd_max_order =  0; | 
					
						
							|  |  |  | 			pgdat->classzone_idx = pgdat->nr_zones - 1; | 
					
						
							|  |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 		if (order < new_order || classzone_idx > new_classzone_idx) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			/*
 | 
					
						
							|  |  |  | 			 * Don't sleep if someone wants a larger 'order' | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 			 * allocation or has tigher zone constraints | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			 */ | 
					
						
							|  |  |  | 			order = new_order; | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 			classzone_idx = new_classzone_idx; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} else { | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 			kswapd_try_to_sleep(pgdat, balanced_order, | 
					
						
							|  |  |  | 						balanced_classzone_idx); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 			order = pgdat->kswapd_max_order; | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 			classzone_idx = pgdat->classzone_idx; | 
					
						
							| 
									
										
											  
											
												kswapd: assign new_order and new_classzone_idx after wakeup in sleeping
There 2 places to read pgdat in kswapd.  One is return from a successful
balance, another is waked up from kswapd sleeping.  The new_order and
new_classzone_idx represent the balance input order and classzone_idx.
But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.
1: after a successful balance, kswapd goes to sleep, and new_order = 0;
   new_classzone_idx = __MAX_NR_ZONES - 1;
2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
3: in the balance_pgdat() running, a new balance wakeup happened with
   order = 5, and classzone_idx = ZONE_NORMAL
4: the first wakeup(order = 3) finished successufly, return order = 3
   but, the new_order is still 0, so, this balancing will be treated as a
   failed balance.  And then the second tighter balancing will be missed.
So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:45 -07:00
										 |  |  | 			new_order = order; | 
					
						
							|  |  |  | 			new_classzone_idx = classzone_idx; | 
					
						
							| 
									
										
										
										
											2011-01-13 15:46:23 -08:00
										 |  |  | 			pgdat->kswapd_max_order = 0; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-07-08 15:39:40 -07:00
										 |  |  | 			pgdat->classzone_idx = pgdat->nr_zones - 1; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:33 -08:00
										 |  |  | 		ret = try_to_freeze(); | 
					
						
							|  |  |  | 		if (kthread_should_stop()) | 
					
						
							|  |  |  | 			break; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * We can speed up thawing tasks if we don't call balance_pgdat | 
					
						
							|  |  |  | 		 * after returning from the refrigerator | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | 		if (!ret) { | 
					
						
							|  |  |  | 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); | 
					
						
							| 
									
										
											  
											
												kswapd: avoid unnecessary rebalance after an unsuccessful balancing
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-10-31 17:08:39 -07:00
										 |  |  | 			balanced_classzone_idx = classzone_idx; | 
					
						
							|  |  |  | 			balanced_order = balance_pgdat(pgdat, order, | 
					
						
							|  |  |  | 						&balanced_classzone_idx); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:16 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2012-11-08 15:53:39 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	current->reclaim_state = NULL; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * A zone is low on free memory, so wake its kswapd task to service it. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							|  |  |  | 	pg_data_t *pgdat; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-01-06 00:11:15 -08:00
										 |  |  | 	if (!populated_zone(zone)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.
Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.
To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.
When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().
vanilla                      11.6615%
disable-threshold            0.2584%
David said:
: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Nicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:45:41 -08:00
										 |  |  | 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return; | 
					
						
							| 
									
										
											  
											
												mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.
Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.
To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.
When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().
vanilla                      11.6615%
disable-threshold            0.2584%
David said:
: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Nicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:45:41 -08:00
										 |  |  | 	pgdat = zone->zone_pgdat; | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 	if (pgdat->kswapd_max_order < order) { | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		pgdat->kswapd_max_order = order; | 
					
						
							| 
									
										
											  
											
												mm: kswapd: stop high-order balancing when any suitable zone is balanced
Simon Kirby reported the following problem
   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4 GB
   of memory that never seem to have less than 1.5 GB free, even with a
   constantly-active VM.  In some cases, these servers also swap out while
   this happens, even though they are constantly reading the working set
   into memory.  We have been seeing this happening for a long time; I
   don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here.  On the target machine, there is a
small Normal zone in comparison to DMA32.  As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers.  The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not.  This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons.  One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
 Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem.  In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases.  Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints.  There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks.  The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363
Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%
proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed.  Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive.  As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130
Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%
proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster.  Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures.  Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node.  It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects.  If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages.  The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones.  kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:46:20 -08:00
										 |  |  | 		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx); | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2005-09-13 01:25:07 -07:00
										 |  |  | 	if (!waitqueue_active(&pgdat->kswapd_wait)) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		return; | 
					
						
							| 
									
										
											  
											
												mm: vmscan: fix numa reclaim balance problem in kswapd
The way the page allocator interacts with kswapd creates aging imbalances,
where the amount of time a userspace page gets in memory under reclaim
pressure is dependent on which zone, which node the allocator took the
page frame from.
#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
   nodes falling behind for a full reclaim cycle relative to the other
   nodes in the system
#3 fixes an interaction where kswapd and a continuous stream of page
   allocations keep the preferred zone of a task between the high and
   low watermark (allocations succeed + kswapd does not go to sleep)
   indefinitely, completely underutilizing the lower zones and
   thrashing on the preferred zone
These patches are the aging fairness part of the thrash-detection based
file LRU balancing.  Andrea recommended to submit them separately as they
are bugfixes in their own right.
The following test ran a foreground workload (memcachetest) with
background IO of various sizes on a 4 node 8G system (similar results were
observed with single-node 4G systems):
parallelio
                                               BAS                    FAIRALLO
                                              BASE                   FAIRALLOC
Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
                 BAS    FAIRALLO
                BASE   FAIRALLOC
User          287.78      460.97
System       2151.67     3142.51
Elapsed      9737.00     8879.34
                                   BAS    FAIRALLO
                                  BASE   FAIRALLOC
Minor Faults                  53721925    57188551
Major Faults                    392195       15157
Swap Ins                       2994854      112770
Swap Outs                      4907092      134982
Direct pages scanned                 0       41824
Kswapd pages scanned          32975063     8128269
Kswapd pages reclaimed         6323069     7093495
Direct pages reclaimed               0       41824
Kswapd efficiency                  19%         87%
Kswapd velocity               3386.573     915.414
Direct efficiency                 100%        100%
Direct velocity                  0.000       4.710
Percentage direct scans             0%          0%
Zone normal velocity          2011.338     550.661
Zone dma32 velocity           1365.623     369.221
Zone dma velocity                9.612       0.242
Page writes by reclaim    18732404.000  614807.000
Page writes file              13825312      479825
Page writes anon               4907092      134982
Page reclaim immediate           85490        5647
Sector Reads                  12080532      483244
Sector Writes                 88740508    65438876
Page rescued immediate               0           0
Slabs scanned                    82560       12160
Direct inode steals                  0           0
Kswapd inode steals              24401       40013
Kswapd skipped wait                  0           0
THP fault alloc                      6           8
THP collapse alloc                5481        5812
THP splits                          75          22
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0          54
Compaction success                   0          45
Compaction failures                  0           9
Page migrate success            881492       82278
Page migrate failure                 0           0
Compaction pages isolated            0       60334
Compaction migrate scanned           0       53505
Compaction free scanned              0     1537605
Compaction cost                    914          86
NUMA PTE updates              46738231    41988419
NUMA hint faults              31175564    24213387
NUMA hint local faults        10427393     6411593
NUMA pages migrated             881492       55344
AutoNUMA cost                   156221      121361
The overall runtime was reduced, throughput for both the foreground
workload as well as the background IO improved, major faults, swapping and
reclaim activity shrunk significantly, reclaim efficiency more than
quadrupled.
This patch:
When the page allocator fails to get a page from all zones in its given
zonelist, it wakes up the per-node kswapds for all zones that are at their
low watermark.
However, with a system under load the free pages in a zone can fluctuate
enough that the allocation fails but the kswapd wakeup is also skipped
while the zone is still really close to the low watermark.
When one node misses a wakeup like this, it won't be aged before all the
other node's zones are down to their low watermarks again.  And skipping a
full aging cycle is an obvious fairness problem.
Kswapd runs until the high watermarks are restored, so it should also be
woken when the high watermarks are not met.  This ages nodes more equally
and creates a safety margin for the page counter fluctuation.
By using zone_balanced(), it will now check, in addition to the watermark,
if compaction requires more order-0 pages to create a higher order page.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-09-11 14:20:39 -07:00
										 |  |  | 	if (zone_balanced(zone, order, 0, 0)) | 
					
						
							| 
									
										
											  
											
												mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.
Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.
To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.
When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().
vanilla                      11.6615%
disable-threshold            0.2584%
David said:
: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Nicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2011-01-13 15:45:41 -08:00
										 |  |  | 		return; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); | 
					
						
							| 
									
										
										
										
											2005-09-13 01:25:07 -07:00
										 |  |  | 	wake_up_interruptible(&pgdat->kswapd_wait); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:42 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * The reclaimable count would be mostly accurate. | 
					
						
							|  |  |  |  * The less reclaimable pages may be | 
					
						
							|  |  |  |  * - mlocked pages, which will be moved to unevictable list when encountered | 
					
						
							|  |  |  |  * - mapped pages, which may require several travels to be reclaimed | 
					
						
							|  |  |  |  * - dirty pages, which is not "instantly" reclaimable | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | unsigned long global_reclaimable_pages(void) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:32 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:42 -07:00
										 |  |  | 	int nr; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	nr = global_page_state(NR_ACTIVE_FILE) + | 
					
						
							|  |  |  | 	     global_page_state(NR_INACTIVE_FILE); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												swap: add per-partition lock for swapfile
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes
from swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list.  To delete that code, we add a
new highest_priority_index.  Whenever get_swap_page() is called, we
check it.  If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the
speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.
[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-02-22 16:34:38 -08:00
										 |  |  | 	if (get_nr_swap_pages() > 0) | 
					
						
							| 
									
										
										
										
											2009-09-21 17:01:42 -07:00
										 |  |  | 		nr += global_page_state(NR_ACTIVE_ANON) + | 
					
						
							|  |  |  | 		      global_page_state(NR_INACTIVE_ANON); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return nr; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-05-24 22:16:31 +02:00
										 |  |  | #ifdef CONFIG_HIBERNATION
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | /*
 | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  |  * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  |  * freed pages. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Rather than trying to age LRUs the aim is to preserve the overall | 
					
						
							|  |  |  |  * LRU order by reclaiming preferentially | 
					
						
							|  |  |  |  * inactive > active > active referenced > active mapped | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | unsigned long shrink_all_memory(unsigned long nr_to_reclaim) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 	struct reclaim_state reclaim_state; | 
					
						
							|  |  |  | 	struct scan_control sc = { | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 		.gfp_mask = GFP_HIGHUSER_MOVABLE, | 
					
						
							|  |  |  | 		.may_swap = 1, | 
					
						
							|  |  |  | 		.may_unmap = 1, | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 		.may_writepage = 1, | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 		.nr_to_reclaim = nr_to_reclaim, | 
					
						
							|  |  |  | 		.hibernation_mode = 1, | 
					
						
							|  |  |  | 		.order = 0, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		.priority = DEF_PRIORITY, | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	struct shrink_control shrink = { | 
					
						
							|  |  |  | 		.gfp_mask = sc.gfp_mask, | 
					
						
							|  |  |  | 	}; | 
					
						
							|  |  |  | 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 	struct task_struct *p = current; | 
					
						
							|  |  |  | 	unsigned long nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 	p->flags |= PF_MEMALLOC; | 
					
						
							|  |  |  | 	lockdep_set_current_reclaim_state(sc.gfp_mask); | 
					
						
							|  |  |  | 	reclaim_state.reclaimed_slab = 0; | 
					
						
							|  |  |  | 	p->reclaim_state = &reclaim_state; | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:34 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 	p->reclaim_state = NULL; | 
					
						
							|  |  |  | 	lockdep_clear_current_reclaim_state(); | 
					
						
							|  |  |  | 	p->flags &= ~PF_MEMALLOC; | 
					
						
							| 
									
										
										
										
											2006-06-23 02:03:18 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: kill hibernation specific reclaim logic and unify it
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s
   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.
    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.
  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.
  schenario) hibernate need 1GB
  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB
  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-12-14 17:59:12 -08:00
										 |  |  | 	return nr_reclaimed; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2009-05-24 22:16:31 +02:00
										 |  |  | #endif /* CONFIG_HIBERNATION */
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | /* It's optimal to keep kswapds on the same CPUs as their memory, but
 | 
					
						
							|  |  |  |    not required for correctness.  So if the last cpu in a node goes | 
					
						
							|  |  |  |    away, we get changed to run anywhere: as the first one comes back, | 
					
						
							|  |  |  |    restore their cpu bindings. */ | 
					
						
							| 
									
										
										
										
											2012-12-21 15:01:06 -08:00
										 |  |  | static int cpu_callback(struct notifier_block *nfb, unsigned long action, | 
					
						
							|  |  |  | 			void *hcpu) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2007-10-16 01:25:40 -07:00
										 |  |  | 	int nid; | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-05-09 02:35:10 -07:00
										 |  |  | 	if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) { | 
					
						
							| 
									
										
										
										
											2012-12-12 13:51:43 -08:00
										 |  |  | 		for_each_node_state(nid, N_MEMORY) { | 
					
						
							| 
									
										
										
										
											2008-04-04 18:11:10 -07:00
										 |  |  | 			pg_data_t *pgdat = NODE_DATA(nid); | 
					
						
							| 
									
										
										
										
											2009-03-13 14:49:46 +10:30
										 |  |  | 			const struct cpumask *mask; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			mask = cpumask_of_node(pgdat->node_id); | 
					
						
							| 
									
										
										
										
											2008-04-04 18:11:10 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-01 10:12:24 +10:30
										 |  |  | 			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids) | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 				/* One of our CPUs online: restore mask */ | 
					
						
							| 
									
										
										
										
											2008-04-04 18:11:10 -07:00
										 |  |  | 				set_cpus_allowed_ptr(pgdat->kswapd, mask); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 		} | 
					
						
							|  |  |  | 	} | 
					
						
							|  |  |  | 	return NOTIFY_OK; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-06-27 02:53:33 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * This kswapd start function will be called by init and node-hot-add. | 
					
						
							|  |  |  |  * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int kswapd_run(int nid) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	pg_data_t *pgdat = NODE_DATA(nid); | 
					
						
							|  |  |  | 	int ret = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	if (pgdat->kswapd) | 
					
						
							|  |  |  | 		return 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); | 
					
						
							|  |  |  | 	if (IS_ERR(pgdat->kswapd)) { | 
					
						
							|  |  |  | 		/* failure at boot is fatal */ | 
					
						
							|  |  |  | 		BUG_ON(system_state == SYSTEM_BOOTING); | 
					
						
							| 
									
										
										
										
											2012-10-08 16:29:27 -07:00
										 |  |  | 		pr_err("Failed to start kswapd on node %d\n", nid); | 
					
						
							|  |  |  | 		ret = PTR_ERR(pgdat->kswapd); | 
					
						
							| 
									
										
										
										
											2013-04-17 15:58:34 -07:00
										 |  |  | 		pgdat->kswapd = NULL; | 
					
						
							| 
									
										
										
										
											2006-06-27 02:53:33 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | 	return ret; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:33 -08:00
										 |  |  | /*
 | 
					
						
							| 
									
										
										
										
											2012-07-11 14:01:52 -07:00
										 |  |  |  * Called by memory hotplug when all memory in a node is offlined.  Caller must | 
					
						
							|  |  |  |  * hold lock_memory_hotplug(). | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:33 -08:00
										 |  |  |  */ | 
					
						
							|  |  |  | void kswapd_stop(int nid) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	struct task_struct *kswapd = NODE_DATA(nid)->kswapd; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-07-11 14:01:52 -07:00
										 |  |  | 	if (kswapd) { | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:33 -08:00
										 |  |  | 		kthread_stop(kswapd); | 
					
						
							| 
									
										
										
										
											2012-07-11 14:01:52 -07:00
										 |  |  | 		NODE_DATA(nid)->kswapd = NULL; | 
					
						
							|  |  |  | 	} | 
					
						
							| 
									
										
										
										
											2009-12-14 17:58:33 -08:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | static int __init kswapd_init(void) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2006-06-27 02:53:33 -07:00
										 |  |  | 	int nid; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:19 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	swap_setup(); | 
					
						
							| 
									
										
										
										
											2012-12-12 13:51:43 -08:00
										 |  |  | 	for_each_node_state(nid, N_MEMORY) | 
					
						
							| 
									
										
										
										
											2006-06-27 02:53:33 -07:00
										 |  |  |  		kswapd_run(nid); | 
					
						
							| 
									
										
										
										
											2005-04-16 15:20:36 -07:00
										 |  |  | 	hotcpu_notifier(cpu_callback, 0); | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | module_init(kswapd_init) | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | #ifdef CONFIG_NUMA
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * Zone reclaim mode | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * If non-zero call zone_reclaim when the number of free pages falls below | 
					
						
							|  |  |  |  * the watermarks. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int zone_reclaim_mode __read_mostly; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:34 -08:00
										 |  |  | #define RECLAIM_OFF 0
 | 
					
						
							| 
									
										
										
										
											2008-07-29 22:33:41 -07:00
										 |  |  | #define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
 | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:34 -08:00
										 |  |  | #define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
 | 
					
						
							|  |  |  | #define RECLAIM_SWAP (1<<2)	/* Swap pages out during reclaim */
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:32 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Priority for ZONE_RECLAIM. This determines the fraction of pages | 
					
						
							|  |  |  |  * of a node considered for each zone_reclaim. 4 scans 1/16th of | 
					
						
							|  |  |  |  * a zone. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | #define ZONE_RECLAIM_PRIORITY 4
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-07-03 00:24:13 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Percentage of pages in a zone that must be unmapped for zone_reclaim to | 
					
						
							|  |  |  |  * occur. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int sysctl_min_unmapped_ratio = 1; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:52 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * If the number of slab pages in a zone grows beyond this percentage then | 
					
						
							|  |  |  |  * slab reclaim needs to occur. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | int sysctl_min_slab_ratio = 5; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.
The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.
This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().
Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. The broken heuristic is what was causing the
	malloc() stall as it uselessly scanned the LRU constantly. Currently,
	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
	could not deal with tmpfs pages at all. This fixes up the heuristic so
	that an unnecessary scan is more likely to be correctly avoided.
Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.
Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported because
	the scan-avoidance heuristic is still broken.
This patch:
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.
Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.
This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.
[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:20 -07:00
										 |  |  | static inline unsigned long zone_unmapped_file_pages(struct zone *zone) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED); | 
					
						
							|  |  |  | 	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) + | 
					
						
							|  |  |  | 		zone_page_state(zone, NR_ACTIVE_FILE); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * It's possible for there to be more file mapped pages than | 
					
						
							|  |  |  | 	 * accounted for by the pages on the file LRU lists because | 
					
						
							|  |  |  | 	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /* Work out how many page cache pages we can reclaim in this reclaim_mode */ | 
					
						
							|  |  |  | static long zone_pagecache_reclaimable(struct zone *zone) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	long nr_pagecache_reclaimable; | 
					
						
							|  |  |  | 	long delta = 0; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * If RECLAIM_SWAP is set, then all file pages are considered | 
					
						
							|  |  |  | 	 * potentially reclaimable. Otherwise, we have to worry about | 
					
						
							|  |  |  | 	 * pages like swapcache and zone_unmapped_file_pages() provides | 
					
						
							|  |  |  | 	 * a better estimate | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	if (zone_reclaim_mode & RECLAIM_SWAP) | 
					
						
							|  |  |  | 		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES); | 
					
						
							|  |  |  | 	else | 
					
						
							|  |  |  | 		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* If we can't clean pages, remove dirty pages from consideration */ | 
					
						
							|  |  |  | 	if (!(zone_reclaim_mode & RECLAIM_WRITE)) | 
					
						
							|  |  |  | 		delta += zone_page_state(zone, NR_FILE_DIRTY); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	/* Watch for any possible underflows due to delta */ | 
					
						
							|  |  |  | 	if (unlikely(delta > nr_pagecache_reclaimable)) | 
					
						
							|  |  |  | 		delta = nr_pagecache_reclaimable; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 	return nr_pagecache_reclaimable - delta; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * Try to free up some pages from this zone through reclaim. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:22 -08:00
										 |  |  | 	/* Minimum pages needed in order to stay on node */ | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:19 -08:00
										 |  |  | 	const unsigned long nr_pages = 1 << order; | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | 	struct task_struct *p = current; | 
					
						
							|  |  |  | 	struct reclaim_state reclaim_state; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	struct scan_control sc = { | 
					
						
							|  |  |  | 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:30 -07:00
										 |  |  | 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), | 
					
						
							| 
									
										
										
										
											2009-04-21 12:24:57 -07:00
										 |  |  | 		.may_swap = 1, | 
					
						
							| 
									
										
										
										
											2013-02-22 16:32:24 -08:00
										 |  |  | 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), | 
					
						
							| 
									
										
											  
											
												mm: teach mm by current context info to not do I/O during memory allocation
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
  GFP_KERNEL is called inside runtime resume callback of any one of its
  ancestors(or the block device itself), the deadlock may be triggered
  inside the memory allocation since it might not complete until the block
  device becomes active and the involed page I/O finishes.  The situation
  is pointed out first by Alan Stern.  It is not a good approach to
  convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
  subsystems may be involved(for example, PCI, USB and SCSI may be
  involved for usb mass stoarage device, network devices involved too in
  the iSCSI case)
- during block device runtime suspend, because runtime resume need to
  wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
  be put on the device, so there shouldn't have any memory allocation with
  GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
  above may be triggered.  Unfortunately, any usb device may include one
  mass storage interface in theory, so it requires all usb interface
  drivers to handle the situation.  In fact, most usb drivers don't know
  how to handle bus reset on the device and don't provide .pre_set() and
  .post_reset() callback at all, so USB core has to unbind and bind driver
  for these devices.  So it is still not practical to resort to GFP_NOIO
  for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found.  Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2013-02-22 16:34:08 -08:00
										 |  |  | 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), | 
					
						
							| 
									
										
										
										
											2009-03-31 15:19:38 -07:00
										 |  |  | 		.order = order, | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 		.priority = ZONE_RECLAIM_PRIORITY, | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:26 -07:00
										 |  |  | 	struct shrink_control shrink = { | 
					
						
							|  |  |  | 		.gfp_mask = sc.gfp_mask, | 
					
						
							|  |  |  | 	}; | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:50 -07:00
										 |  |  | 	unsigned long nr_slab_pages0, nr_slab_pages1; | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	cond_resched(); | 
					
						
							| 
									
										
										
										
											2006-02-24 13:04:22 -08:00
										 |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP | 
					
						
							|  |  |  | 	 * and we also need to be able to write out pages for RECLAIM_WRITE | 
					
						
							|  |  |  | 	 * and RECLAIM_SWAP. | 
					
						
							|  |  |  | 	 */ | 
					
						
							|  |  |  | 	p->flags |= PF_MEMALLOC | PF_SWAPWRITE; | 
					
						
							| 
									
										
										
										
											2010-03-05 13:41:47 -08:00
										 |  |  | 	lockdep_set_current_reclaim_state(gfp_mask); | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | 	reclaim_state.reclaimed_slab = 0; | 
					
						
							|  |  |  | 	p->reclaim_state = &reclaim_state; | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:29 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.
The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.
This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().
Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. The broken heuristic is what was causing the
	malloc() stall as it uselessly scanned the LRU constantly. Currently,
	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
	could not deal with tmpfs pages at all. This fixes up the heuristic so
	that an unnecessary scan is more likely to be correctly avoided.
Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.
Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported because
	the scan-avoidance heuristic is still broken.
This patch:
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.
Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.
This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.
[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:20 -07:00
										 |  |  | 	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) { | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:52 -07:00
										 |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Free memory by calling shrink zone with increasing | 
					
						
							|  |  |  | 		 * priorities until we have enough memory freed. | 
					
						
							|  |  |  | 		 */ | 
					
						
							|  |  |  | 		do { | 
					
						
							| 
									
										
										
										
											2012-05-29 15:06:57 -07:00
										 |  |  | 			shrink_zone(zone, &sc); | 
					
						
							|  |  |  | 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:52 -07:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:29 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:50 -07:00
										 |  |  | 	nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); | 
					
						
							|  |  |  | 	if (nr_slab_pages0 > zone->min_slab_pages) { | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:35 -08:00
										 |  |  | 		/*
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:22 -08:00
										 |  |  | 		 * shrink_slab() does not currently allow us to determine how | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:52 -07:00
										 |  |  | 		 * many pages were freed in this zone. So we take the current | 
					
						
							|  |  |  | 		 * number of slab pages and shake the slab until it is reduced | 
					
						
							|  |  |  | 		 * by the same nr_pages that we used for reclaiming unmapped | 
					
						
							|  |  |  | 		 * pages. | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:35 -08:00
										 |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2013-08-28 10:18:03 +10:00
										 |  |  | 		nodes_clear(shrink.nodes_to_scan); | 
					
						
							|  |  |  | 		node_set(zone_to_nid(zone), shrink.nodes_to_scan); | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:54 -07:00
										 |  |  | 		for (;;) { | 
					
						
							|  |  |  | 			unsigned long lru_pages = zone_reclaimable_pages(zone); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			/* No reclaimable slab or very low memory pressure */ | 
					
						
							| 
									
										
										
										
											2011-05-24 17:12:27 -07:00
										 |  |  | 			if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages)) | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:54 -07:00
										 |  |  | 				break; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			/* Freed enough memory */ | 
					
						
							|  |  |  | 			nr_slab_pages1 = zone_page_state(zone, | 
					
						
							|  |  |  | 							NR_SLAB_RECLAIMABLE); | 
					
						
							|  |  |  | 			if (nr_slab_pages1 + nr_pages <= nr_slab_pages0) | 
					
						
							|  |  |  | 				break; | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:53 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 		/*
 | 
					
						
							|  |  |  | 		 * Update nr_reclaimed by the number of slab pages we | 
					
						
							|  |  |  | 		 * reclaimed from this zone. | 
					
						
							|  |  |  | 		 */ | 
					
						
							| 
									
										
										
										
											2010-08-09 17:19:50 -07:00
										 |  |  | 		nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); | 
					
						
							|  |  |  | 		if (nr_slab_pages1 < nr_slab_pages0) | 
					
						
							|  |  |  | 			sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1; | 
					
						
							| 
									
										
										
										
											2006-02-01 03:05:35 -08:00
										 |  |  | 	} | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | 	p->reclaim_state = NULL; | 
					
						
							| 
									
										
										
										
											2006-02-24 13:04:22 -08:00
										 |  |  | 	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); | 
					
						
							| 
									
										
										
										
											2010-03-05 13:41:47 -08:00
										 |  |  | 	lockdep_clear_current_reclaim_state(); | 
					
						
							| 
									
										
											  
											
												vmscan: bail out of direct reclaim after swap_cluster_max pages
When the VM is under pressure, it can happen that several direct reclaim
processes are in the pageout code simultaneously.  It also happens that
the reclaiming processes run into mostly referenced, mapped and dirty
pages in the first round.
This results in multiple direct reclaim processes having a lower
pageout priority, which corresponds to a higher target of pages to
scan.
This in turn can result in each direct reclaim process freeing
many pages.  Together, they can end up freeing way too many pages.
This kicks useful data out of memory (in some cases more than half
of all memory is swapped out).  It also impacts performance by
keeping tasks stuck in the pageout code for too long.
A 30% improvement in hackbench has been observed with this patch.
The fix is relatively simple: in shrink_zone() we can check how many
pages we have already freed, direct reclaim tasks break out of the
scanning loop if they have already freed enough pages and have reached
a lower priority level.
We do not break out of shrink_zone() when priority == DEF_PRIORITY,
to ensure that equal pressure is applied to every zone in the common
case.
However, in order to do this we do need to know how many pages we already
freed, so move nr_reclaimed into scan_control.
akpm: a historical interlude...
We tried this in 2004:
:commit e468e46a9bea3297011d5918663ce6d19094cf87
:Author: akpm <akpm>
:Date:   Thu Jun 24 15:53:52 2004 +0000
:
:[PATCH] vmscan.c: dont reclaim too many pages
:
:    The shrink_zone() logic can, under some circumstances, cause far too many
:    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
:    a large number of reclaimable pages on the LRU.
:    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
And we reverted it in 2006:
:commit 210fe530305ee50cd889fe9250168228b2994f32
:Author: Andrew Morton <akpm@osdl.org>
:Date:   Fri Jan 6 00:11:14 2006 -0800
:
:    [PATCH] vmscan: balancing fix
:
:    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
:
:      The shrink_zone() logic can, under some circumstances, cause far too many
:      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
:      hit a large number of reclaimable pages on the LRU.
:
:      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
:      reclaimed.
:
:    Problem is, this change caused significant imbalance in inter-zone scan
:    balancing by truncating scans of larger zones.
:
:    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
:    balancing algorithm would require that if we're scanning 100 pages of
:    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
:    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
:    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
:    harder than large ones.
:
:    Now I need to remember what the workload was which caused me to write this
:    patch originally, then fix it up in a different way...
And we haven't demonstrated that whatever problem caused that reversion is
not being reintroduced by this change in 2008.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-01-06 14:40:01 -08:00
										 |  |  | 	return sc.nr_reclaimed >= nr_pages; | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  | 	int node_id; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 	int ret; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:52 -07:00
										 |  |  | 	 * Zone reclaim reclaims unmapped file backed pages and | 
					
						
							|  |  |  | 	 * slab pages if we are over the defined limits. | 
					
						
							| 
									
										
										
										
											2006-06-30 01:55:37 -07:00
										 |  |  | 	 * | 
					
						
							| 
									
										
										
										
											2006-07-03 00:24:13 -07:00
										 |  |  | 	 * A small portion of unmapped file backed pages is needed for | 
					
						
							|  |  |  | 	 * file I/O otherwise pages read by file I/O will be immediately | 
					
						
							|  |  |  | 	 * thrown out if the zone is overallocated. So we do not reclaim | 
					
						
							|  |  |  | 	 * if less than a specified percentage of the zone is used by | 
					
						
							|  |  |  | 	 * unmapped file backed pages. | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
											  
											
												vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.
The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.
This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().
Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. The broken heuristic is what was causing the
	malloc() stall as it uselessly scanned the LRU constantly. Currently,
	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
	could not deal with tmpfs pages at all. This fixes up the heuristic so
	that an unnecessary scan is more likely to be correctly avoided.
Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.
Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported because
	the scan-avoidance heuristic is still broken.
This patch:
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.
Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.
This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.
[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2009-06-16 15:33:20 -07:00
										 |  |  | 	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages && | 
					
						
							|  |  |  | 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:22 -07:00
										 |  |  | 		return ZONE_RECLAIM_FULL; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-09-11 14:22:36 -07:00
										 |  |  | 	if (!zone_reclaimable(zone)) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:22 -07:00
										 |  |  | 		return ZONE_RECLAIM_FULL; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	/*
 | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 	 * Do not scan if the allocation should not be delayed. | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC)) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:22 -07:00
										 |  |  | 		return ZONE_RECLAIM_NOSCAN; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	/*
 | 
					
						
							|  |  |  | 	 * Only run zone reclaim on the local zone or on zones that do not | 
					
						
							|  |  |  | 	 * have associated processors. This will favor the local processor | 
					
						
							|  |  |  | 	 * over remote processors and spread off node memory allocations | 
					
						
							|  |  |  | 	 * as wide as possible. | 
					
						
							|  |  |  | 	 */ | 
					
						
							| 
									
										
										
										
											2006-09-25 23:31:55 -07:00
										 |  |  | 	node_id = zone_to_nid(zone); | 
					
						
							| 
									
										
										
										
											2007-10-16 01:25:36 -07:00
										 |  |  | 	if (node_state(node_id, N_CPU) && node_id != numa_node_id()) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:22 -07:00
										 |  |  | 		return ZONE_RECLAIM_NOSCAN; | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | 	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:22 -07:00
										 |  |  | 		return ZONE_RECLAIM_NOSCAN; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 	ret = __zone_reclaim(zone, gfp_mask, order); | 
					
						
							|  |  |  | 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-06-16 15:33:23 -07:00
										 |  |  | 	if (!ret) | 
					
						
							|  |  |  | 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED); | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-10-16 23:26:01 -07:00
										 |  |  | 	return ret; | 
					
						
							| 
									
										
										
										
											2006-03-22 00:08:18 -08:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2006-01-18 17:42:31 -08:00
										 |  |  | #endif
 | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * page_evictable - test whether a page is evictable | 
					
						
							|  |  |  |  * @page: the page to test | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Test whether page is evictable--i.e., should be placed on active/inactive | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  |  * lists vs unevictable list. | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  |  * | 
					
						
							|  |  |  |  * Reasons page might not be evictable: | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:42 -07:00
										 |  |  |  * (1) page's mapping marked unevictable | 
					
						
							| 
									
										
											  
											
												mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.
   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:44 -07:00
										 |  |  |  * (2) page is part of an mlocked VMA | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:42 -07:00
										 |  |  |  * | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | int page_evictable(struct page *page) | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 	return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page); | 
					
						
							| 
									
										
											  
											
												Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2008-10-18 20:26:39 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix long unpreemptible section
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked.  It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
PAGEVEC_SIZE pages would be good.
However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.
So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling.  Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.
Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().
Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.
The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.
The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.
Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:19 -08:00
										 |  |  | #ifdef CONFIG_SHMEM
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | /**
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  |  * check_move_unevictable_pages - check pages for evictability and move to appropriate zone lru list | 
					
						
							|  |  |  |  * @pages:	array of pages to check | 
					
						
							|  |  |  |  * @nr_pages:	number of pages to check | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  |  * | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  |  * Checks pages for evictability and moves them to the appropriate lru list. | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix long unpreemptible section
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked.  It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
PAGEVEC_SIZE pages would be good.
However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.
So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling.  Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.
Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().
Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.
The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.
The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.
Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:19 -08:00
										 |  |  |  * | 
					
						
							|  |  |  |  * This function is only used for SysV IPC SHM_UNLOCK. | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  |  */ | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | void check_move_unevictable_pages(struct page **pages, int nr_pages) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2012-01-12 17:18:15 -08:00
										 |  |  | 	struct lruvec *lruvec; | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 	struct zone *zone = NULL; | 
					
						
							|  |  |  | 	int pgscanned = 0; | 
					
						
							|  |  |  | 	int pgrescued = 0; | 
					
						
							|  |  |  | 	int i; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 	for (i = 0; i < nr_pages; i++) { | 
					
						
							|  |  |  | 		struct page *page = pages[i]; | 
					
						
							|  |  |  | 		struct zone *pagezone; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 		pgscanned++; | 
					
						
							|  |  |  | 		pagezone = page_zone(page); | 
					
						
							|  |  |  | 		if (pagezone != zone) { | 
					
						
							|  |  |  | 			if (zone) | 
					
						
							|  |  |  | 				spin_unlock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 			zone = pagezone; | 
					
						
							|  |  |  | 			spin_lock_irq(&zone->lru_lock); | 
					
						
							|  |  |  | 		} | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 		lruvec = mem_cgroup_page_lruvec(page, zone); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 		if (!PageLRU(page) || !PageUnevictable(page)) | 
					
						
							|  |  |  | 			continue; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-10-08 16:33:18 -07:00
										 |  |  | 		if (page_evictable(page)) { | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 			enum lru_list lru = page_lru_base_type(page); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 			VM_BUG_ON(PageActive(page)); | 
					
						
							|  |  |  | 			ClearPageUnevictable(page); | 
					
						
							| 
									
										
										
										
											2012-05-29 15:07:09 -07:00
										 |  |  | 			del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE); | 
					
						
							|  |  |  | 			add_page_to_lru_list(page, lruvec, lru); | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 			pgrescued++; | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 		} | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 	} | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.
The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages().  The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.
Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().
But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.
Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.
Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:21 -08:00
										 |  |  | 	if (zone) { | 
					
						
							|  |  |  | 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued); | 
					
						
							|  |  |  | 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned); | 
					
						
							|  |  |  | 		spin_unlock_irq(&zone->lru_lock); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:43 -07:00
										 |  |  | 	} | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
											  
											
												SHM_UNLOCK: fix long unpreemptible section
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked.  It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
PAGEVEC_SIZE pages would be good.
However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.
So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling.  Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.
Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().
Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.
The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.
The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.
Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
											
										 
											2012-01-20 14:34:19 -08:00
										 |  |  | #endif /* CONFIG_SHMEM */
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | static void warn_scan_unevictable_pages(void) | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | 	printk_once(KERN_WARNING | 
					
						
							| 
									
										
										
										
											2012-01-10 15:07:40 -08:00
										 |  |  | 		    "%s: The scan_unevictable_pages sysctl/node-interface has been " | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | 		    "disabled for lack of a legitimate use case.  If you have " | 
					
						
							| 
									
										
										
										
											2012-01-10 15:07:40 -08:00
										 |  |  | 		    "one, please send an email to linux-mm@kvack.org.\n", | 
					
						
							|  |  |  | 		    current->comm); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /*
 | 
					
						
							|  |  |  |  * scan_unevictable_pages [vm] sysctl handler.  On demand re-scan of | 
					
						
							|  |  |  |  * all nodes' unevictable lists for evictable pages | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | unsigned long scan_unevictable_pages; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | int scan_unevictable_handler(struct ctl_table *table, int write, | 
					
						
							| 
									
										
										
										
											2009-09-23 15:57:19 -07:00
										 |  |  | 			   void __user *buffer, | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 			   size_t *length, loff_t *ppos) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | 	warn_scan_unevictable_pages(); | 
					
						
							| 
									
										
										
										
											2009-09-23 15:57:19 -07:00
										 |  |  | 	proc_doulongvec_minmax(table, write, buffer, length, ppos); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 	scan_unevictable_pages = 0; | 
					
						
							|  |  |  | 	return 0; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:28 -07:00
										 |  |  | #ifdef CONFIG_NUMA
 | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * per node 'scan_unevictable_pages' attribute.  On demand re-scan of | 
					
						
							|  |  |  |  * a specified node's per zone unevictable lists for evictable pages. | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-12-21 14:48:43 -08:00
										 |  |  | static ssize_t read_scan_unevictable_node(struct device *dev, | 
					
						
							|  |  |  | 					  struct device_attribute *attr, | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 					  char *buf) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | 	warn_scan_unevictable_pages(); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 	return sprintf(buf, "0\n");	/* always zero; should fit... */ | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-12-21 14:48:43 -08:00
										 |  |  | static ssize_t write_scan_unevictable_node(struct device *dev, | 
					
						
							|  |  |  | 					   struct device_attribute *attr, | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 					const char *buf, size_t count) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-10-31 17:09:13 -07:00
										 |  |  | 	warn_scan_unevictable_pages(); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 	return 1; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-12-21 14:48:43 -08:00
										 |  |  | static DEVICE_ATTR(scan_unevictable_pages, S_IRUGO | S_IWUSR, | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | 			read_scan_unevictable_node, | 
					
						
							|  |  |  | 			write_scan_unevictable_node); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | int scan_unevictable_register_node(struct node *node) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-12-21 14:48:43 -08:00
										 |  |  | 	return device_create_file(&node->dev, &dev_attr_scan_unevictable_pages); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | void scan_unevictable_unregister_node(struct node *node) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
										
										
											2011-12-21 14:48:43 -08:00
										 |  |  | 	device_remove_file(&node->dev, &dev_attr_scan_unevictable_pages); | 
					
						
							| 
									
										
										
										
											2008-10-18 20:26:53 -07:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2010-10-26 14:21:28 -07:00
										 |  |  | #endif
 |