We were seeing faults in the solos-pci receive tasklet when packets
arrived for a VCC which was currently being closed:
[18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] SS:ESP 0068:dfb89d14
[18845.090712] [<c13ecff3>] ? do_page_fault+0x0/0x2e1
[18845.120042] [<e082f490>] ? br2684_push+0x19/0x234 [br2684]
[18845.153530] [<e084fa13>] solos_bh+0x28b/0x7c8 [solos_pci]
[18845.186488] [<e084f711>] ? solos_irq+0x2d/0x51 [solos_pci]
[18845.219960] [<c100387b>] ? handle_irq+0x3b/0x48
[18845.247732] [<c10265cb>] ? irq_exit+0x34/0x57
[18845.274437] [<c1025720>] tasklet_action+0x42/0x69
[18845.303247] [<c102643f>] __do_softirq+0x8e/0x129
[18845.331540] [<c10264ff>] do_softirq+0x25/0x2a
[18845.358274] [<c102664c>] _local_bh_enable_ip+0x5e/0x6a
[18845.389677] [<c102666d>] local_bh_enable+0xb/0xe
[18845.417944] [<e08490a8>] ppp_unregister_channel+0x32/0xbb [ppp_generic]
[18845.458193] [<e08731ad>] pppox_unbind_sock+0x18/0x1f [pppox]
This patch uses an RCU-inspired approach to fix it. In the RX tasklet's
find_vcc() function we first refuse to use a VCC which already has the
ATM_VF_READY bit cleared. And in the VCC close function, we synchronise
with the tasklet to ensure that it can't still be using the VCC before
we continue and allow the VCC to be destroyed.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Tested-by: Nathan Williams <nathan@traverse.com.au>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there was added ->tcf_chain() method without ->bind_tcf() to
sch_sfq class options, there is oops when a filter is added with
the classid parameter.
Fixes commit 7d2681a6ff
netdev thread: null pointer at cls_api.c
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Reported-by: Franchoze Eric <franchoze@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although netif_rx() isn't expected to be called in process context with
preemption enabled, it'd better handle this case. And this is why get_cpu()
is used in the non-RPS #ifdef branch. If tree RCU is selected,
rcu_read_lock() won't disable preemption, so preempt_disable() should be
called explictly.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_parse_md5sig_option doesn't check md5sig option (TCPOPT_MD5SIG)
length, but tcp_v[46]_inbound_md5_hash assume that it's at least 16
bytes long.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: (82 commits)
firewire: core: add forgotten dummy driver methods, remove unused ones
firewire: add isochronous multichannel reception
firewire: core: small clarifications in core-cdev
firewire: core: remove unused code
firewire: ohci: release channel in error path
firewire: ohci: use memory barriers to order descriptor updates
tools/firewire: nosy-dump: increment program version
tools/firewire: nosy-dump: remove unused code
tools/firewire: nosy-dump: use linux/firewire-constants.h
tools/firewire: nosy-dump: break up a deeply nested function
tools/firewire: nosy-dump: make some symbols static or const
tools/firewire: nosy-dump: change to kernel coding style
tools/firewire: nosy-dump: work around segfault in decode_fcp
tools/firewire: nosy-dump: fix it on x86-64
tools/firewire: add userspace front-end of nosy
firewire: nosy: note ioctls in ioctl-number.txt
firewire: nosy: use generic printk macros
firewire: nosy: endianess fixes and annotations
firewire: nosy: annotate __user pointers and __iomem pointers
firewire: nosy: fix device shutdown with active client
...
* 'acpica' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (27 commits)
ACPI / ACPICA: Simplify acpi_ev_initialize_gpe_block()
ACPI / ACPICA: Fail acpi_gpe_wakeup() if ACPI_GPE_CAN_WAKE is unset
ACPI / ACPICA: Do not execute _PRW methods during initialization
ACPI: Fix bogus GPE test in acpi_bus_set_run_wake_flags()
ACPICA: Update version to 20100702
ACPICA: Fix for Alias references within Package objects
ACPICA: Fix lint warning for 64-bit constant
ACPICA: Remove obsolete GPE function
ACPICA: Update debug output components
ACPICA: Add support for WDDT - Watchdog Descriptor Table
ACPICA: Drop acpi_set_gpe
ACPICA: Use low-level GPE enable during GPE block initialization
ACPI / EC: Do not use acpi_set_gpe
ACPI / EC: Drop suspend and resume routines
ACPICA: Remove wakeup GPE reference counting which is not used
ACPICA: Introduce acpi_gpe_wakeup()
ACPICA: Rename acpi_hw_gpe_register_bit
ACPICA: Update version to 20100528
ACPICA: Add signatures for undefined tables: ATKG, GSCI, IEIT
ACPICA: Optimization: Reduce the number of namespace walks
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Add missing <linux/slab.h> include
IB/ehca: Drop unnecessary NULL test
RDMA/nes: Fix confusing if statement indentation
IB/ehca: Init irq tasklet before irq can happen
RDMA/nes: Fix misindented code
RDMA/nes: Fix showing wqm_quanta
RDMA/nes: Get rid of "set but not used" variables
RDMA/nes: Read firmware version from correct place
IB/srp: Export req_lim via sysfs
IB/srp: Make receive buffer handling more robust
IB/srp: Use print_hex_dump()
IB: Rename RAW_ETY to RAW_ETHERTYPE
RDMA/nes: Fix two sparse warnings
RDMA/cxgb3: Make needlessly global iwch_l2t_send() static
IB/iser: Make needlessly global iser_alloc_rx_descriptors() static
RDMA/cxgb4: Add timeouts when waiting for FW responses
IB/qib: Fix race between qib_error_qp() and receive packet processing
IB/qib: Limit the number of packets processed per interrupt
IB/qib: Allow writes to the diag_counters to be able to clear them
IB/qib: Set cfgctxts to number of CPUs by default
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (214 commits)
ALSA: hda - Add pin-fix for HP dc5750
ALSA: als4000: Fix potentially invalid DMA mode setup
ALSA: als4000: enable burst mode
ALSA: hda - Fix initial capsrc selection in patch_alc269()
ASoC: TWL4030: Capture route runtime DAPM ordering fix
ALSA: hda - Add PC-beep whitelist for an Intel board
ALSA: hda - More relax for pending period handling
ALSA: hda - Define AC_FMT_* constants
ALSA: hda - Fix beep frequency on IDT 92HD73xx and 92HD71Bxx codecs
ALSA: hda - Add support for HDMI HBR passthrough
ALSA: hda - Set Stream Type in Stream Format according to AES0
ALSA: hda - Fix Thinkpad X300 so SPDIF is not exposed
ALSA: hda - FIX to not expose SPDIF on Thinkpad X301, since it does not have the ability to use SPDIF
ASoC: wm9081: fix resource reclaim in wm9081_register error path
ASoC: wm8978: fix a memory leak if a wm8978_register fail
ASoC: wm8974: fix a memory leak if another WM8974 is registered
ASoC: wm8961: fix resource reclaim in wm8961_register error path
ASoC: wm8955: fix resource reclaim in wm8955_register error path
ASoC: wm8940: fix a memory leak if wm8940_register return error
ASoC: wm8904: fix resource reclaim in wm8904_register error path
...
* 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
do_coredump: Do not take BKL
init: Remove the BKL from startup code
* 'for-2.6.36' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd4: fix file open accounting for RDWR opens
nfsd: don't allow setting maxblksize after svc created
nfsd: initialize nfsd versions before creating svc
net: sunrpc: removed duplicated #include
nfsd41: Fix a crash when a callback is retried
nfsd: fix startup/shutdown order bug
nfsd: minor nfsd read api cleanup
gcc-4.6: nfsd: fix initialized but not read warnings
nfsd4: share file descriptors between stateid's
nfsd4: fix openmode checking on IO using lock stateid
nfsd4: miscellaneous process_open2 cleanup
nfsd4: don't pretend to support write delegations
nfsd: bypass readahead cache when have struct file
nfsd: minor nfsd_svc() cleanup
nfsd: move more into nfsd_startup()
nfsd: just keep single lockd reference for nfsd
nfsd: clean up nfsd_create_serv error handling
nfsd: fix error handling in __write_ports_addxprt
nfsd: fix error handling when starting nfsd with rpcbind down
nfsd4: fix v4 state shutdown error paths
...
Fix the module init error handling. There are a bunch of goto labels for
aborting the init procedure at different points and just undoing what needs
undoing - they aren't all in the right places, however.
This can lead to an oops like the following:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff81042a31>] destroy_workqueue+0x17/0xc0
...
Modules linked in: kafs(+) dns_resolver rxkad af_rxrpc fscache
Pid: 2171, comm: insmod Not tainted 2.6.35-cachefs+ #319 DG965RY/
...
Process insmod (pid: 2171, threadinfo ffff88003ca6a000, task ffff88003dcc3050)
...
Call Trace:
[<ffffffffa0055994>] afs_callback_update_kill+0x10/0x12 [kafs]
[<ffffffffa007d1c5>] afs_init+0x190/0x1ce [kafs]
[<ffffffffa007d035>] ? afs_init+0x0/0x1ce [kafs]
[<ffffffff810001ef>] do_one_initcall+0x59/0x14e
[<ffffffff8105f7ee>] sys_init_module+0x9c/0x1de
[<ffffffff81001eab>] system_call_fastpath+0x16/0x1b
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'nfs-for-2.6.36' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (42 commits)
NFS: NFSv4.1 is no longer a "developer only" feature
NFS: NFS_V4 is no longer an EXPERIMENTAL feature
NFS: Fix /proc/mount for legacy binary interface
NFS: Fix the locking in nfs4_callback_getattr
SUNRPC: Defer deleting the security context until gss_do_free_ctx()
SUNRPC: prevent task_cleanup running on freed xprt
SUNRPC: Reduce asynchronous RPC task stack usage
SUNRPC: Move the bound cred to struct rpc_rqst
SUNRPC: Clean up of rpc_bindcred()
SUNRPC: Move remaining RPC client related task initialisation into clnt.c
SUNRPC: Ensure that rpc_exit() always wakes up a sleeping task
SUNRPC: Make the credential cache hashtable size configurable
SUNRPC: Store the hashtable size in struct rpc_cred_cache
NFS: Ensure the AUTH_UNIX credcache is allocated dynamically
NFS: Fix the NFS users of rpc_restart_call()
SUNRPC: The function rpc_restart_call() should return success/failure
NFSv4: Get rid of the bogus RPC_ASSASSINATED(task) checks
NFSv4: Clean up the process of renewing the NFSv4 lease
NFSv4.1: Handle NFS4ERR_DELAY on SEQUENCE correctly
NFS: nfs_rename() should not have to flush out writebacks
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (45 commits)
nilfs2: reject filesystem with unsupported block size
nilfs2: avoid rec_len overflow with 64KB block size
nilfs2: simplify nilfs_get_page function
nilfs2: reject incompatible filesystem
nilfs2: add feature set fields to super block
nilfs2: clarify byte offset in super block format
nilfs2: apply read-ahead for nilfs_btree_lookup_contig
nilfs2: introduce check flag to btree node buffer
nilfs2: add btree get block function with readahead option
nilfs2: add read ahead mode to nilfs_btnode_submit_block
nilfs2: fix buffer head leak in nilfs_btnode_submit_block
nilfs2: eliminate inline keywords in btree implementation
nilfs2: get maximum number of child nodes from bmap object
nilfs2: reduce repetitive calculation of max number of child nodes
nilfs2: optimize calculation of min/max number of btree node children
nilfs2: remove redundant pointer checks in bmap lookup functions
nilfs2: get rid of nilfs_bmap_union
nilfs2: unify bmap set_target_v operations
nilfs2: get rid of nilfs_btree uses
nilfs2: get rid of nilfs_direct uses
...
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: Adding error check after calling ext4_mb_regular_allocator()
ext4: Fix dirtying of journalled buffers in data=journal mode
ext4: re-inline ext4_rec_len_(to|from)_disk functions
jbd2: Remove t_handle_lock from start_this_handle()
jbd2: Change j_state_lock to be a rwlock_t
jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
ext4: Add mount options in superblock
ext4: force block allocation on quota_off
ext4: fix freeze deadlock under IO
ext4: drop inode from orphan list if ext4_delete_inode() fails
ext4: check to make make sure bd_dev is set before dereferencing it
jbd2: Make barrier messages less scary
ext4: don't print scary messages for allocation failures post-abort
ext4: fix EFBIG edge case when writing to large non-extent file
ext4: fix ext4_get_blocks references
ext4: Always journal quota file modifications
ext4: Fix potential memory leak in ext4_fill_super
ext4: Don't error out the fs if the user tries to make a file too big
ext4: allocate stripe-multiple IOs on stripe boundaries
ext4: move aio completion after unwritten extent conversion
...
Fix up conflicts in fs/ext4/inode.c as per Ted.
Fix up xfs conflicts as per earlier xfs merge.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext3: Fix dirtying of journalled buffers in data=journal mode
ext3: default to ordered mode
quota: Use mark_inode_dirty_sync instead of mark_inode_dirty
quota: Change quota error message to print out disk and function name
MAINTAINERS: Update entries of ext2 and ext3
MAINTAINERS: Update address of Andreas Dilger
ext3: Avoid filesystem corruption after a crash under heavy delete load
ext3: remove vestiges of nobh support
ext3: Fix set but unused variables
quota: clean up quota active checks
quota: Clean up the namespace in dqblk_xfs.h
quota: check quota reservation on remove_dquot_ref
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
fs/dlm: Drop unnecessary null test
dlm: use genl_register_family_with_ops()
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: super.c Fix warning: variable 'sbi' set but not used
udf: remove duplicated #include
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[DNS RESOLVER] Minor typo correction
DNS: Fixes for the DNS query module
cifs: Include linux/err.h for IS_ERR and PTR_ERR
DNS: Make AFS go to the DNS for AFSDB records for unknown cells
DNS: Separate out CIFS DNS Resolver code
cifs: account for new creduid=0x%x parameter in spnego upcall string
cifs: reduce false positives with inode aliasing serverino autodisable
CIFS: Make cifs_convert_address() take a const src pointer and a length
cifs: show features compiled in as part of DebugData
cifs: update README
Fix up trivial conflicts in fs/cifs/cifsfs.c due to workqueue changes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
workqueue: mark init_workqueues() as early_initcall()
workqueue: explain for_each_*cwq_cpu() iterators
fscache: fix build on !CONFIG_SYSCTL
slow-work: kill it
gfs2: use workqueue instead of slow-work
drm: use workqueue instead of slow-work
cifs: use workqueue instead of slow-work
fscache: drop references to slow-work
fscache: convert operation to use workqueue instead of slow-work
fscache: convert object to use workqueue instead of slow-work
workqueue: fix how cpu number is stored in work->data
workqueue: fix mayday_mask handling on UP
workqueue: fix build problem on !CONFIG_SMP
workqueue: fix locking in retry path of maybe_create_worker()
async: use workqueue for worker pool
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
workqueue: implement unbound workqueue
workqueue: prepare for WQ_UNBOUND implementation
libata: take advantage of cmwq and remove concurrency limitations
workqueue: fix worker management invocation without pending works
...
Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
Stephen reports:
After merging the block tree, today's linux-next build (x86_64
allmodconfig) failed like this:
usr/include/linux/fs.h:11: included file 'linux/blk_types.h' is not exported
Caused by commit 9d3dbbcd9a84518ff5e32ffe671d06a48cf84fd9 ("bio, fs:
separate out bio_types.h and define READ/WRITE constants in terms of
BIO_RW_* flags").
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix a bug where a lock is _bh nested within another _bh lock,
but forgets to use the _bh variant for unlock.
Further more, it's not necessary to test _bh locks, the inner lock
can just use spin_lock(). So fix up the bug by making that change.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It was a now abandoned attempt to throttle resync bandwidth
based on the delay it causes on the bulk data socket.
It has no userbase yet, and has been disabled by
9173465ccb51c09cc3102a10af93e9f469a0af6f already.
This removes the now unused code.
The basic feature, namely using up "idle" bandwith
of network and disk IO subsystem, with minimal impact
to application IO, is being reimplemented differently.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch makes sure we first initialize everything and set the BDI_registered
flag, and only after this we add the bdi to 'bdi_list'. Current code adds the
bdi to the list too early, and as a result I the
WARN(!test_bit(BDI_registered, &bdi->state)
in bdi forker is triggered. Also, it is in general good practice to make things
visible only when they are fully initialized.
Also, this patch does few micro clean-ups:
1. Removes the 'exit' label which does not do anything, just returns. This
allows to get rid of few braces and 'ret' variable and make the code smaller.
2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded
'-ENOMEM'. Theoretically, some day 'kthread_run()' can return something
else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to
NULL.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add 2 new trace points to the periodic write-back wake up case, just like we do
in the 'bdi_queue_work()' function. Namely, introduce:
1. trace_writeback_wake_thread(bdi)
2. trace_writeback_wake_forker_thread(bdi)
The first event is triggered every time we wake up a bdi thread to start
periodic background write-out. The second event is triggered only when the bdi
thread does not exist and should be created by the forker thread.
This patch was suggested by Dave Chinner and Christoph Hellwig.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The 'setup_timer()' function also calls 'init_timer()', so the extra
'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically
'init_timer()' plus callback function and data pointers initialization.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.
This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.
This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.
We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.
Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.
This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
bad for battery-driven devices.
There are two types of activities bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back
So there are 2 sources of wake-up events for bdi threads:
1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's
The former already has bdi wake-up code. The latter does not, and this patch
adds it.
'__mark_inode_dirty()' is hot-path function, but this patch adds another
'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
the bdi has no dirty inodes. So adding this spinlock should be fine and should
not affect performance.
This patch makes sure bdi threads and the forker thread do not wake-up if there
is nothing to do. The forker thread will nevertheless wake up at least every
5 min. to check whether it has to kill a bdi thread. This can also be optimized,
but is not worth it.
This patch also tidies up the warning about unregistered bid, and turns it from
an ugly crocodile to a simple 'WARN()' statement.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, bdi threads can decide to exit if there were no useful activities
for 5 minutes. However, this causes nasty races: we can easily oops in the
'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.
And even if we do not oops, but the bdi tread exits immediately after we wake
it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
in the bdi work processing.
This patch makes the forker thread to be the central place which not only
creates bdi threads, but also kills them if they were inactive long enough.
This better design-wise.
Another reason why this change was done is to prepare for the further changes
which will prevent the bdi threads from waking up every 5 sec and wasting
power. Indeed, when the task does not wake up periodically anymore, it won't be
able to exit either.
This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
forker thread as well. So now the forker thread sets the BDI_pending bit, then
forks the task or kills it, then clears the bit and wakes up the waiting
process.
The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
function was changed as well - now it first removes the bdi from the
'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
guaranteed that the forker thread won't race with it, because the bdi is not
visible. Note, the forker thread sets the 'BDI_pending' bit under the
'bdi->wb_lock' which is essential for proper serialization.
And additionally, when we change 'bdi->wb.task', we now take the
'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
would when raced with, say, 'bdi_queue_work()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch re-structures the bdi forker a little:
1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason
for this is that the forker thread can start _before_ the 'BDI_registered'
flag is set (see 'bdi_register()'), so the WARN() statement will fire for
the default bdi. I observed this warning at boot-up.
2. Introduce an enum 'action' and use "switch" statement in the outer loop.
This is a preparation to the further patch which will teach the forker
thread killing bdi threads, so we'll have another case in the "switch"
statement. This change was suggested by Christoph Hellwig.
This patch is just a small step towards the coming change where the forker
thread will kill the bdi threads. It should simplify reviewing the following
changes, which would otherwise be larger.
This patch also amends comments a little.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently bdi threads use local variable 'last_active' which stores last time
when the bdi thread did some useful work. Move this local variable to 'struct
bdi_writeback'. This is just a preparation for the further patches which will
make the forker thread decide when bdi threads should be killed.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
But this is wrong for at least 2 reasons.
Reason #1: if we temporary remove a bdi from the list, we may miss works which
would otherwise be given to us.
Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
it races with the forker thread, it can shut down the bdi thread
at the same time as the forker creates it.
This patch makes sure the forker thread never removes bdis from 'bdi_list'
(which was suggested by Christoph Hellwig).
In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
flag.
NOTE! The error path is interesting. Currently, when we fail to create a bdi
thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
bdi from the list, we cannot move it to the tail either, because then we can
mess up the RCU readers which walk the list. And also, we'll have the race
described above in "Reason #2".
But I not think that adding to the tail is any important so I just do not do
that.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch simplifies bdi code a little by removing the 'pending_list' which is
redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
working like this:
1. In a loop, fetch all bdi's which have works but have no writeback thread and
move them to the 'pending_list'.
2. If the list is empty, sleep for 5 sec.
3. Otherwise, take one bdi from the list, fork the writeback thread for this
bdi, and repeat the loop.
IOW, it first moves everything to the 'pending_list', then process only one
element, and so on. This patch simplifies the algorithm, which is now as
follows.
1. Find the first bdi which has a work and remove it from the global list of
bdi's (bdi_list).
2. If there was not such bdi, sleep 5 sec.
3. Fork the writeback thread for this bdi and repeat the loop.
IOW, now we find the first bdi to process, process it, and so on. This is
simpler and involves less lists.
The bonus now is that we can get rid of a couple of functions, as well as
remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.
This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
'list_add_tail()', but this piece of code is going to be removed in the next
patch anyway.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, bdi threads ('bdi_writeback_thread()') can lose wake-ups. For
example, if 'bdi_queue_work()' is executed after the bdi thread have had
finished 'wb_do_writeback()' but before it called
'schedule_timeout_interruptible()'.
To fix this issue, we have to check whether we have works to process after we
have changed the task state to 'TASK_INTERRUPTIBLE'.
This patch also clean-ups handling of the cases when 'dirty_writeback_interval'
is zero or non-zero.
Additionally, this patch also removes unneeded 'list_empty_careful()' call.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, if someone submits jobs for the default bdi, we can lose wake-up
events. E.g., this can happen if 'bdi_queue_work()' is called when
'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but
before 'set_current_state(TASK_INTERRUPTIBLE)'.
This situation is unlikely, and the result is not very severe - we'll just
delay the execution of the work, but this is still not very nice.
This patch fixes the issue by checking whether the default bdi has works before
the forker thread goes sleep.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently the forker thread can lose wake-ups which may lead to unnecessary
delays in processing bdi works. E.g., consider the following scenario.
1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to
do, and is about to finish the loop.
2. A bdi thread decides to exit because it was inactive for long time.
3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up
the forker thread.
4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)'
and goes sleep. We lose a wake-up.
Losing the wake-up is not fatal, but this means that the bdi work processing
will be delayed by up to 5 sec. This race is theoretical, I never hit it, but
it is worth fixing.
The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking
'bdi_list', not after.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch fixes a very unlikely race condition on the bdi forker thread error
path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code
for a short period of time. If at the same time someone submits a work to this
bdi, we can end up with an oops 'bdi_queue_work()' while executing
'wake_up_process(wb->task)'.
This patch fixes the issue by introducing a temporary variable 'task' and
storing the possible error code there, so that 'wb->task' would never take
erroneous values.
Note, this race is very unlikely and I never hit it, so it is theoretical, but
nevertheless worth fixing.
This patch also merges 2 comments which were previously separate.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The write-back code mixes words "thread" and "task" for the same things. This
is not a big deal, but still an inconsistency.
hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointer
This patch renames:
* 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
* 'bdi_forker_task()' -> 'bdi_forker_thread()'
because bdi threads are 'bdi_writeback_thread()', so these names are more
consistent.
This patch also amends commentaries and makes them refer the forker and bdi
threads as "thread", not "task".
Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
'static void' instead of 'void static' and make checkpatch.pl happy.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
linux/fs.h hard coded READ/WRITE constants which should match BIO_RW_*
flags. This is fragile and caused breakage during BIO_RW_* flag
rearrangement. The hardcoding is to avoid include dependency hell.
Create linux/bio_types.h which contatins definitions for bio data
structures and flags and include it from bio.h and fs.h, and make fs.h
define all READ/WRITE related constants in terms of BIO_RW_* flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Commit a82afdf (block: use the same failfast bits for bio and request)
moved BIO_RW_* bits around such that they match up with REQ_* bits.
Unfortunately, fs.h hard coded RW_MASK, RWA_MASK, READ, WRITE, READA
and SWRITE as 0, 1, 2 and 3, and expected them to match with BIO_RW_*
bits. READ/WRITE didn't change but BIO_RW_AHEAD was moved to bit 4
instead of bit 1, breaking RWA_MASK, READA and SWRITE.
This patch updates RWA_MASK, READA and SWRITE such that they match the
BIO_RW_* bits again. A follow up patch will update the definitions to
directly use BIO_RW_* bits so that this kind of breakage won't happen
again.
Neil also spotted missing RWA_MASK conversion.
Stable: The offending commit a82afdf was released with v2.6.32, so
this patch should be applied to all kernels since then but it must
_NOT_ be applied to kernels earlier than that.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-bisected-by: Vladislav Bolkhovitin <vst@vlnb.net>
Root-caused-by: Neil Brown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Filesystems can call sb_issue_discard on a memory reclaim path
(e.g. ext4 calls sb_issue_discard during journal commit).
Use GFP_NOFS in sb_issue_discard to avoid recursing back into the FS.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
put_user() may fail, if so return -EFAULT.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
83ba7b07 cleans up the writeback.
So we don't use wb any more in get_next_work_item.
Let's remove unnecessary argument.
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the
pipe. In __generic_file_splice_read(), however, it causes an EAGAIN
if the page is currently being read.
This makes it impossible to write an application that only wants
failure if the pipe is full. For example if the same process is
handling both ends of a pipe and isn't otherwise able to determine
whether a splice to the pipe will fill it or not.
We could make the read non-blocking on O_NONBLOCK or some other splice
flag, but for now this is the simplest fix.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If there's no feature-barrier key in xenstore, then it means its a fairly
old backend which does uncached in-order writes, which means ORDERED_DRAIN
is appropriate.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When barriers are supported, then use QUEUE_ORDERED_TAG to tell the block
subsystem that it doesn't need to do anything else with the barriers.
Previously we used ORDERED_DRAIN which caused the block subsystem to
drain all pending IO before submitting the barrier, which would be
very expensive.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>