linux-pinenote

Author	SHA1	Message	Date
Thomas Gleixner	a731cd116c	xfs: semaphore cleanup Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead. (Ported to current XFS code by <aelder@sgi.com>.) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:09:09 -05:00
Sascha Hauer	63f1474c69	mxc_nand: do not depend on disabling the irq in the interrupt handler This patch reverts the driver to enabling/disabling the NFC interrupt mask rather than enabling/disabling the system interrupt. This cleans up the driver so that it doesn't rely on interrupts being disabled within the interrupt handler. For i.MX21 we keep the current behaviour, that is calling enable_irq/disable_irq_nosync to enable/disable interrupts. This patch is based on earlier work by John Ogness. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: John Ogness <john.ogness@linutronix.de> Tested-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-18 13:09:05 -07:00
Arkadiusz Mi?kiewicz	6743099ce5	xfs: Extend project quotas to support 32bit project ids This patch adds support for 32bit project quota identifiers. On disk format is backward compatible with 16bit projid numbers. projid on disk is now kept in two 16bit values - di_projid_lo (which holds the same position as old 16bit projid value) and new di_projid_hi (takes existing padding) and converts from/to 32bit value on the fly. xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used to enable PROJID32BIT support. Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:08 -05:00
Christoph Hellwig	1a1a3e97ba	xfs: remove xfs_buf wrappers Stop having two different names for many buffer functions and use the more descriptive xfs_buf_* names directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:07 -05:00
Christoph Hellwig	6c77b0ea1b	xfs: remove xfs_cred.h We're not actually passing around credentials inside XFS for a while now, so remove all xfs_cred.h with it's cred_t typedef and all instances of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:06 -05:00
Christoph Hellwig	78a4b0961f	xfs: remove xfs_globals.h This header only provides one extern that isn't actually declared anywhere, and shadowed by a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:05 -05:00
Christoph Hellwig	668332e5fe	xfs: remove xfs_version.h It used to have a place when it contained an automatically generated CVS version, but these days it's entirely superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:04 -05:00
Christoph Hellwig	1ae4fe6dba	xfs: remove xfs_refcache.h This header has been completely unused for a couple of years. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:03 -05:00
Christoph Hellwig	4957a449a1	xfs: fix the xfs_trans_committed Use the correct prototype for xfs_trans_committed instead of casting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:02 -05:00
Christoph Hellwig	dfe188d428	xfs: remove unused t_callback field in struct xfs_trans Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:01 -05:00
Christoph Hellwig	d276734d93	xfs: fix bogus m_maxagi check in xfs_iget These days inode64 should only control which AGs we allocate new inodes from, while we still try to support reading all existing inodes. To make this actually work the check ontop of xfs_iget needs to be relaxed to allow inodes in all allocation groups instead of just those that we allow allocating inodes from. Note that we can't simply remove the check - it prevents us from accessing invalid data when fed invalid inode numbers from NFS or bulkstat. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:01 -05:00
Christoph Hellwig	1b0407125f	xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb and remove support for per-cpu counters from xfs_mod_incore_sb_batch to simplify it. And added benefit is that we don't have to take m_sb_lock for transactions that only modify per-cpu counters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:08:00 -05:00
Christoph Hellwig	96540c7858	xfs: do not use xfs_mod_incore_sb for per-cpu counters Export xfs_icsb_modify_counters and always use it for modifying the per-cpu counters. Remove support for per-cpu counters from xfs_mod_incore_sb to simplify it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:59 -05:00
Christoph Hellwig	61ba35dea0	xfs: remove XFS_MOUNT_NO_PERCPU_SB Fail the mount if we can't allocate memory for the per-CPU counters. This is consistent with how we handle everything else in the mount path and makes the superblock counter modification a lot simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:58 -05:00
Dave Chinner	50f59e8eed	xfs: pack xfs_buf structure more tightly pahole reports the struct xfs_buf has quite a few holes in it, so packing the structure better will reduce the size of it by 16 bytes. Also, move all the fields used in cache lookups into the first cacheline. Before on x86_64: /* size: 320, cachelines: 5 / / sum members: 298, holes: 6, sum holes: 22 / After on x86_64: / size: 304, cachelines: 5 / / padding: 6 / / last cacheline: 48 bytes */ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:57 -05:00
Dave Chinner	74f75a0cb7	xfs: convert buffer cache hash to rbtree The buffer cache hash is showing typical hash scalability problems. In large scale testing the number of cached items growing far larger than the hash can efficiently handle. Hence we need to move to a self-scaling cache indexing mechanism. I have selected rbtrees for indexing becuse they can have O(log n) search scalability, and insert and remove cost is not excessive, even on large trees. Hence we should be able to cache large numbers of buffers without incurring the excessive cache miss search penalties that the hash is imposing on us. To ensure we still have parallel access to the cache, we need multiple trees. Rather than hashing the buffers by disk address to select a tree, it seems more sensible to separate trees by typical access patterns. Most operations use buffers from within a single AG at a time, so rather than searching lots of different lists, separate the buffer indexes out into per-AG rbtrees. This means that searches during metadata operation have a much higher chance of hitting cache resident nodes, and that updates of the tree are less likely to disturb trees being accessed on other CPUs doing independent operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:56 -05:00
Dave Chinner	69b491c214	xfs: serialise inode reclaim within an AG Memory reclaim via shrinkers has a terrible habit of having N+M concurrent shrinker executions (N = num CPUs, M = num kswapds) all trying to shrink the same cache. When the cache they are all working on is protected by a single spinlock, massive contention an slowdowns occur. Wrap the per-ag inode caches with a reclaim mutex to serialise reclaim access to the AG. This will block concurrent reclaim in each AG but still allow reclaim to scan multiple AGs concurrently. Allow shrinkers to move on to the next AG if it can't get the lock, and if we can't get any AG, then start blocking on locks. To prevent reclaimers from continually scanning the same inodes in each AG, add a cursor that tracks where the last reclaim got up to and start from that point on the next reclaim. This should avoid only ever scanning a small number of inodes at the satart of each AG and not making progress. If we have a non-shrinker based reclaim pass, ignore the cursor and reset it to zero once we are done. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:55 -05:00
Dave Chinner	e3a20c0b02	xfs: batch inode reclaim lookup Batch and optimise the per-ag inode lookup for reclaim to minimise scanning overhead. This involves gang lookups on the radix trees to get multiple inodes during each tree walk, and tighter validation of what inodes can be reclaimed without blocking befor we take any locks. This is based on ideas suggested in a proof-of-concept patch posted by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:54 -05:00
Dave Chinner	78ae525676	xfs: implement batched inode lookups for AG walking With the reclaim code separated from the generic walking code, it is simple to implement batched lookups for the generic walk code. Separate out the inode validation from the execute operations and modify the tree lookups to get a batch of inodes at a time. Reclaim operations will be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:53 -05:00
Dave Chinner	e13de955ca	xfs: split out inode walk inode grabbing When doing read side inode cache walks, the code to validate and grab an inode is common to all callers. Split it out of the execute callbacks in preparation for batching lookups. Similarly, split out the inode reference dropping from the execute callbacks into the main lookup look to be symmetric with the grab. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:52 -05:00
Dave Chinner	65d0f20533	xfs: split inode AG walking into separate code for reclaim The reclaim walk requires different locking and has a slightly different walk algorithm, so separate it out so that it can be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:52 -05:00
Dave Chinner	69d6cc76cf	xfs: remove buftarg hash for external devices For RT and external log devices, we never use hashed buffers on them now. Remove the buftarg hash tables that are set up for them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:51 -05:00
Dave Chinner	1922c949c5	xfs: use unhashed buffers for size checks When we are checking we can access the last block of each device, we do not need to use cached buffers as they will be tossed away immediately. Use uncached buffers for size checks so that all IO prior to full in-memory structure initialisation does not use the buffer cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:50 -05:00
Dave Chinner	26af655233	xfs: kill XBF_FS_MANAGED buffers Filesystem level managed buffers are buffers that have their lifecycle controlled by the filesystem layer, not the buffer cache. We currently cache these buffers, which makes cleanup and cache walking somewhat troublesome. Convert the fs managed buffers to uncached buffers obtained by via xfs_buf_get_uncached(), and remove the XBF_FS_MANAGED special cases from the buffer cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:49 -05:00
Dave Chinner	ebad861b57	xfs: store xfs_mount in the buftarg instead of in the xfs_buf Each buffer contains both a buftarg pointer and a mount pointer. If we add a mount pointer into the buftarg, we can avoid needing the b_mount field in every buffer and grab it from the buftarg when needed instead. This shrinks the xfs_buf by 8 bytes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:48 -05:00
Dave Chinner	5adc94c247	xfs: introduced uncached buffer read primitve To avoid the need to use cached buffers for single-shot or buffers cached at the filesystem level, introduce a new buffer read primitive that bypasses the cache an reads directly from disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:47 -05:00
Dave Chinner	686865f76e	xfs: rename xfs_buf_get_nodaddr to be more appropriate xfs_buf_get_nodaddr() is really used to allocate a buffer that is uncached. While it is not directly assigned a disk address, the fact that they are not cached is a more important distinction. With the upcoming uncached buffer read primitive, we should be consistent with this disctinction. While there, make page allocation in xfs_buf_get_nodaddr() safe against memory reclaim re-entrancy into the filesystem by allowing a flags parameter to be passed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:46 -05:00
Dave Chinner	dcd79a1423	xfs: don't use vfs writeback for pure metadata modifications Under heavy multi-way parallel create workloads, the VFS struggles to write back all the inodes that have been changed in age order. The bdi flusher thread becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing the superblock dirty inode list to separate dirty inodes old enough to flush. We already keep an index of all metadata changes in age order - in the AIL - and continued log pressure will do age ordered writeback without any extra overhead at all. If there is no pressure on the log, the xfssyncd will periodically write back metadata in ascending disk address offset order so will be very efficient. Hence we can stop marking VFS inodes dirty during transaction commit or when changing timestamps during transactions. This will keep the inodes in the superblock dirty list to those containing data or unlogged metadata changes. However, the timstamp changes are slightly more complex than this - there are a couple of places that do unlogged updates of the timestamps, and the VFS need to be informed of these. Hence add a new function xfs_trans_ichgtime() for transactional changes, and leave xfs_ichgtime() for the non-transactional changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:45 -05:00
Dave Chinner	e176579e70	xfs: lockless per-ag lookups When we start taking a reference to the per-ag for every cached buffer in the system, kernel lockstat profiling on an 8-way create workload shows the mp->m_perag_lock has higher acquisition rates than the inode lock and has significantly more contention. That is, it becomes the highest contended lock in the system. The perag lookup is trivial to convert to lock-less RCU lookups because perag structures never go away. Hence the only thing we need to protect against is tree structure changes during a grow. This can be done simply by replacing the locking in xfs_perag_get() with RCU read locking. This removes the mp->m_perag_lock completely from this path. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:44 -05:00
Dave Chinner	bd32d25a7c	xfs: remove debug assert for per-ag reference counting When we start taking references per cached buffer to the the perag it is cached on, it will blow the current debug maximum reference count assert out of the water. The assert has never caught a bug, and we have tracing to track changes if there ever is a problem, so just remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:43 -05:00
Dave Chinner	d1583a3833	xfs: reduce the number of CIL lock round trips during commit When commiting a transaction, we do a lock CIL state lock round trip on every single log vector we insert into the CIL. This is resulting in the lock being as hot as the inode and dcache locks on 8-way create workloads. Rework the insertion loops to bring the number of lock round trips to one per transaction for log vectors, and one more do the busy extents. Also change the allocation of the log vector buffer not to zero it as we copy over the entire allocated buffer anyway. This patch also includes a structural cleanup to the CIL item insertion provided by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:42 -05:00
Poyo VL	9c169915ad	xfs: eliminate some newly-reported gcc warnings Ionut Gabriel Popescu <poyo_vl@yahoo.com> submitted a simple change to eliminate some "may be used uninitialized" warnings when building XFS. The reported condition seems to be something that GCC did not used to recognize or report. The warnings were produced by: gcc version 4.5.0 20100604 [gcc-4_5-branch revision 160292] (SUSE Linux) Signed-off-by: Ionut Gabriel Popescu <poyo_vl@yahoo.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:39 -05:00
Christoph Hellwig	c0e59e1ac0	xfs: remove the ->kill_root btree operation The implementation os ->kill_root only differ by either simply zeroing out the now unused buffer in the btree cursor in the inode allocation btree or using xfs_btree_setbuf in the allocation btree. Initially both of them used xfs_btree_setbuf, but the use in the ialloc btree was removed early on because it interacted badly with xfs_trans_binval. In addition to zeroing out the buffer in the cursor xfs_btree_setbuf updates the bc_ra array in the btree cursor, and calls xfs_trans_brelse on the buffer previous occupying the slot. The bc_ra update should be done for the alloc btree updated too, although the lack of it does not cause serious problems. The xfs_trans_brelse call on the other hand is effectively a no-op in the end - it keeps decrementing the bli_recur refcount until it hits zero, and then just skips out because the buffer will always be dirty at this point. So removing it for the allocation btree is just fine. So unify the code and move it to xfs_btree.c. While we're at it also replace the call to xfs_btree_setbuf with a NULL bp argument in xfs_btree_del_cursor with a direct call to xfs_trans_brelse given that the cursor is beeing freed just after this and the state updates are superflous. After this xfs_btree_setbuf is only used with a non-NULL bp argument and can thus be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:38 -05:00
Christoph Hellwig	acecf1b5d8	xfs: stop using xfs_qm_dqtobp in xfs_qm_dqflush In xfs_qm_dqflush we know that q_blkno must be initialized already from a previous xfs_qm_dqread. So instead of calling xfs_qm_dqtobp we can simply read the quota buffer directly. This also saves us from a duplicate xfs_qm_dqcheck call check and allows xfs_qm_dqtobp to be simplified now that it is always called for a newly initialized inode. In addition to that properly unwind all locks in xfs_qm_dqflush when xfs_qm_dqcheck fails. This mirrors a similar cleanup in the inode lookup done earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:37 -05:00
Christoph Hellwig	52fda11424	xfs: simplify xfs_qm_dqusage_adjust There is no need to have the users and group/project quota locked at the same time. Get rid of xfs_qm_dqget_noattach and just do a xfs_qm_dqget inside xfs_qm_quotacheck_dqadjust for the quota we are operating on right now. The new version of xfs_qm_quotacheck_dqadjust holds the inode lock over it's operations, which is not a problem as it simply increments counters and there is no concern about log contention during mount time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-10-18 15:07:36 -05:00
Dave Chinner	4472235205	xfs: Introduce XFS_IOC_ZERO_RANGE XFS_IOC_ZERO_RANGE is the equivalent of an atomic XFS_IOC_UNRESVSP/ XFS_IOC_RESVSP call pair. It enabled ranges of written data to be turned into zeroes without requiring IO or having to free and reallocate the extents in the range given as would occur if we had to punch and then preallocate them separately. This enables applications to zero parts of files very quickly without changing the layout of the files in any way. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:25 -05:00
Dave Chinner	3ae4c9deb3	xfs: use range primitives for xfs page cache operations While XFS passes ranges to operate on from the core code, the functions being called ignore the either the entire range or the end of the range. This is historical because when the function were written linux didn't have the necessary range operations. Update the functions to use the correct operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-10-18 15:07:24 -05:00
Linus Torvalds	f68c834b04	Merge branch 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux * 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux: i2c-imx: do not allow interruptions when waiting for I2C to complete i2c-davinci: Fix TX setup for more SoCs	2010-10-18 13:05:10 -07:00
Linus Torvalds	822a2e4524	Merge branch 'fixes' * fixes: v4l1: fix 32-bit compat microcode loading translation De-pessimize rds_page_copy_user	2010-10-18 13:04:33 -07:00
Ingo Molnar	b7dadc3879	sched: Export account_system_vtime() KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:30 +02:00
Venkatesh Pallipadi	d267f87fb8	sched: Call tick_check_idle before __irq_enter When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit `ee5f80a`: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:29 +02:00
Venkatesh Pallipadi	aa48380851	sched: Remove irq time from available CPU power The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:27 +02:00
Venkatesh Pallipadi	305e6835e0	sched: Do not account irq time to current task Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:26 +02:00
Venkatesh Pallipadi	e82b8e4ea4	x86: Add IRQ_TIME_ACCOUNTING This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it when TSC is enabled. This change just enables fine grained irq time accounting, isn't used yet. Following patches use it for different purposes. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:25 +02:00
Venkatesh Pallipadi	b52bfee445	sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:24 +02:00
Venkatesh Pallipadi	6cdd5199da	sched: Add a PF flag for ksoftirqd identification To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:22 +02:00
Venkatesh Pallipadi	e1e10a265d	sched: Consolidate account_system_vtime extern declaration Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:21 +02:00
Venkatesh Pallipadi	75e1056f5c	sched: Fix softirq time accounting Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:20 +02:00
Nikhil Rao	75dd321d79	sched: Drop group_capacity to 1 only if local group has extra capacity When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:19 +02:00
Nikhil Rao	fab476228b	sched: Force balancing on newidle balance if local group has capacity This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 20:52:18 +02:00

... 33 34 35 36 37 ...

217556 commits