Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails. Call schedule_delayed_work() after everything is
setup, just before returning.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.
ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid. This is fixed in
"libceph: switch to calc_target(), part 2".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add and decode pi->min_size and pi->last_force_request_resend. These
are going to be used by calc_target().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
calc_target() code is going to need to know how to compare PGs. Take
lhs and rhs pgid by const * while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes. This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around. Introduce and switch to ceph_osds to help with that.
Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request(). Reoder ceph_eversion fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *. This reduces noise in rbd_dev_header_name().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For a message pool message, preallocate a page, just like we do for
osd_op. For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_osdc_build_request() is going away. Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put. Moving the next_obj_request cursor is then a use-after-free on
img_request.
It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all
devices regardless of whether or not async suspend/resume is
enabled for them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXRglcAAoJEILEb/54YlRxCwEQAKQNLO5wrNnOQmVrMHaOk1XH
4zT+AdggLRVgBld4d4Vcli2zJPZfpnZuzjJdwTB5zLgQ4WIcBb6meOfH87XGqRyJ
o4ksyEUhpvDk8AdlmTA5CvvjFuydPJG5ZUSiM035XRT9heebvhgyaMBnT3ucXbq9
7LhNhCQ+a8arndt9ePO7tZnFfQQUbwNJ2BDVuH5DJBqMIFOo2/Kpag43CdFWlWZT
jnWaleDCjSmanuJ/45bFJHJeSZ7PK2etnArfzKtb9QLSGnuEfFPdHuUzJYo5dkP7
UBeYA94hhfR3f5FJIqNlF3N+eLEX1idpwxc8+CJLLDKDd1ZCBrbLoz5fwM+fVn0h
AfmyR+J1czcbiphsmpViOYDRrKdiQVkbP6SpBswvgMCZAcNDF2bxhzOlcuTUc+u0
8xsjWOtArL6uvzsAHa1HY6hhgUn9FB8m20HX+DmS2/zzqyzoRefenoyVcuLsAhXC
fm+sARQ7tvy3OoGRQ9mloWgv2X5iQUY5IVjOG2amIhbUvVmKQutPjVTGTwHqmmcb
2nNYptLsTA6crvnexPcPHY+OFjkQl/omtfaMx+OJl63yhln5ibveGOfZ6F8sPdoB
bRqHuHoK/xh9hSNwj117ZFzq1nm54mLjh0Yhw3EXFcV4I9vdsTp8/WeNThGvT17j
M+6PDXyjlwh3HZpGm+HW
=3vtL
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are two stable-candidate fixes (PM core, cpuidle) and a bunch of
cpufreq cleanups.
Specifics:
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all devices
regardless of whether or not async suspend/resume is enabled for
them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar)"
* tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Handle failures in device_suspend_late() consistently
cpufreq: schedutil: Improve prints messages with pr_fmt
cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
cpufreq: simplified goto out in cpufreq_register_driver()
cpufreq: governor: CPUFREQ_GOV_STOP never fails
cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
intel_pstate: Simplify conditional in intel_pstate_set_policy()
the only case when we should skip the iterate_and_advance() guts
is when nothing's left in the iterator, _not_ just when requested
amount is 0. Said guts will do nothing in the latter case anyway;
the problem we tried to deal with in the aforementioned commit is
that when there's nothing left *and* the amount requested is 0,
we might end up deferencing one iovec too many; the value we fetch
from there is discarded in that case, but theoretically it might
oops if the iovec array ends exactly at the end of page with the
next page not mapped.
Bailing out on zero size requested had an unexpected side effect -
zero-length segment in the beginning of iovec array ended up
throwing do_loop_readv_writev() into infinite spin; we do not
advance past the empty segment at all. Reproducer is trivial:
echo '#include <sys/uio.h>' >a.c
echo 'main() {char c; struct iovec v[] = {{&c,0},{&c,1}}; readv(0,v,2);}' >>a.c
cc a.c && ./a.out </proc/uptime
which should end up with the process not hanging. Probably ought to
go into LTP or xfstests...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 98e9cb57 improved the xattr name checks in xattr_resolve_name but
didn't update the NULL attribute name check appropriately, so NULL
attribute names lead to NULL pointer dereferences. Turn that into
-EINVAL results instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
fs/xattr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.
But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.
Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The team_device_event() notifier calls team_compute_features() to fix
vlan_features under team->lock to protect team->port_list. The problem is
that subsequent __team_compute_features() calls netdev_change_features()
to propagate vlan_features to upper vlan devices while team->lock is still
taken. This can lead to deadlock when NETIF_F_LRO is modified on lower
devices or team device itself.
Example:
The team0 as active backup with eth0 and eth1 NICs. Both eth0 & eth1 are
LRO capable and LRO is enabled. Thus LRO is also enabled on team0.
The command 'ethtool -K team0 lro off' now hangs due to this deadlock:
dev_ethtool()
-> ethtool_set_features()
-> __netdev_update_features(team)
-> netdev_sync_lower_features()
-> netdev_update_features(lower_1)
-> __netdev_update_features(lower_1)
-> netdev_features_change(lower_1)
-> call_netdevice_notifiers(...)
-> team_device_event(lower_1)
-> team_compute_features(team) [TAKES team->lock]
-> netdev_change_features(team)
-> __netdev_update_features(team)
-> netdev_sync_lower_features()
-> netdev_update_features(lower_2)
-> __netdev_update_features(lower_2)
-> netdev_features_change(lower_2)
-> call_netdevice_notifiers(...)
-> team_device_event(lower_2)
-> team_compute_features(team) [DEADLOCK]
The bug is present in team from the beginning but it appeared after the commit
fd867d5 (net/core: generic support for disabling netdev features down stack)
that adds synchronization of features with lower devices.
Fixes: fd867d5 (net/core: generic support for disabling netdev features down stack)
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On cheetahplus chips we take the ctx_alloc_lock in order to
modify the TLB lookup parameters for the indexed TLBs, which
are stored in the context register.
This is called with interrupts disabled, however ctx_alloc_lock
is an IRQ safe lock, therefore we must take acquire/release it
properly with spin_{lock,unlock}_irq().
Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
Documentation: dsa: misc fixes
Here are some miscelaneous documentation fixes for DSA, I targeted "net"
because these are not functional code changes, but still documentation fixes
per-se.
Changes in v2:
- reword what the port_vlan_filtering is about based on feedback from Vivien and Ido
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Described what the port_vlan_filtering function is supposed to
accomplish.
Fixes: fb2dabad69 ("net: dsa: support VLAN filtering switchdev attr")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer have a priv_size structure member since 5feebd0a8a ("net:
dsa: Remove allocation of driver private memory")
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function has been removed in 4baee937b8 ("net: dsa: remove DSA
link polling") in favor of using the PHYLIB polling mechanism.
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise, if we fail to allocate new PIO buffers, our TXQs will try to
use the old ones, which aren't there any more.
Fixes: 183233bec8 "sfc: Allocate and link PIO buffers; map them with write-combining"
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In IB networks, and specifically in IPoIB/rdmacm traffic, the device
address of an IPoIB interface is used as a means to exchange information
between nodes needed for communication.
Currently an IPoIB interface will always be created with a device
address based on its node GUID without a way to change that.
This change adds the ability to set the device address of an IPoIB
interface by value. We use the set mac address ndo to do that.
The flow should be broken down to two:
1) The GID value is already in the GID table,
in this case the interface will be able to set carrier up.
2) The GID value is not yet in the GID table,
in this case the interface won't try to join the multicast group
and will wait (listen on GID_CHANGE event) until the GID is inserted.
In order to track those changes, we add a new flag:
* IPOIB_FLAG_DEV_ADDR_SET.
When set, it means the dev_addr is a based on a value in the gid
table. this bit will be cleared upon a dev_addr change triggered
by the user and set after validation.
Per IB spec the port GUID can't change if the module is loaded.
port GUID is the basis for GID at index 0 which is the basis for
the default device address of a ipoib interface.
The issue is that there are devices that don't follow the spec,
they change the port GUID while HCA is powered on, so in order
not to break userspace applications. We need to check if the
user wanted to control the device address and we assume that
if he sets the device address back to be based on GID index 0,
he no longer wishs to control it.
In order to track this, we add an additional flag:
* IPOIB_FLAG_DEV_ADDR_CTRL
When setting the device address, there is no validation of the upper
twelve bytes of the device address (flags, qpn, subnet prefix) as those
bytes are not under the control of the user.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Check (via an SA query) if the SM supports the new option for SendOnly
multicast joins.
If the SM supports that option it will use the new join state to create
such multicast group.
If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
join for SendOnly MCG, use the correct state if supported.
This check is performed at every invocation of mcast_restart task, to be
sure that the driver stays in sync with the current state of the SM.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There are four types for MCG, FullMember, NonMember, SendOnlyNonMember,
and the new added type: SendOnlyFullMember.
Add support for the new SendOnlyFullMember join state.
The new type allows host to send join request as sendonly, it will cause the
group to be created but without getting packets from this multicast back to the
host.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
New SA query function to return the ClassPortInfo struct from the SA.
If the SM supports FullMemberSendOnly mode for MCG's, it sets a
capability bit in the capability_mask2 field of the response.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change struct ib_class_port_info to conform to IB Spec 1.3
That in order to get specific capability mask from ClassPortInfo mad.
>From the IB Spec, ClassPortInfo section:
"CapabilityMask2 Bits 0-26: Additional class-specific capabilities...
RespTimeValue the rest 5 bits"
The new struct now has one field for capabilitymask2 (previously was the
reserved field) and the resp_time field.
And it fixes up qib and srpt, use of the field repurposed to be used as
capabilitymask2:
IB/qib: Change pma_get_classportinfo
IB/srpt: Adjust the use of ib_class_port_info
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Gregory CLEMENT says:
====================
Fix spinlock usage in HWBM
these two patches fix spinlock related issues introduced in v4.6. They
have been reported by Russell King and Jean-Jacques Hiblot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When hwbm_pool_add exited in error the spinlock was not released. This
patch fixes this issue.
Fixes: 8cb2d8bf57 ("net: add a hardware buffer management helper API")
Reported-by: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock used by the hwbm functions must be initialized by the
network driver. This commit fixes this lack and the following erros when
lockdep is enabled:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
[<c010ff80>] (unwind_backtrace) from [<c010bd08>] (show_stack+0x10/0x14)
[<c010bd08>] (show_stack) from [<c032913c>] (dump_stack+0xb4/0xe0)
[<c032913c>] (dump_stack) from [<c01670e4>] (__lock_acquire+0x1f58/0x2060)
[<c01670e4>] (__lock_acquire) from [<c0167dec>] (lock_acquire+0xa4/0xd0)
[<c0167dec>] (lock_acquire) from [<c06f6650>] (_raw_spin_lock_irqsave+0x54/0x68)
[<c06f6650>] (_raw_spin_lock_irqsave) from [<c058e830>] (hwbm_pool_add+0x1c/0xdc)
[<c058e830>] (hwbm_pool_add) from [<c043f4e8>] (mvneta_bm_pool_use+0x338/0x490)
[<c043f4e8>] (mvneta_bm_pool_use) from [<c0443198>] (mvneta_probe+0x654/0x1284)
[<c0443198>] (mvneta_probe) from [<c03b894c>] (platform_drv_probe+0x4c/0xb0)
[<c03b894c>] (platform_drv_probe) from [<c03b7158>] (driver_probe_device+0x214/0x2c0)
[<c03b7158>] (driver_probe_device) from [<c03b72c4>] (__driver_attach+0xc0/0xc4)
[<c03b72c4>] (__driver_attach) from [<c03b5440>] (bus_for_each_dev+0x68/0x9c)
[<c03b5440>] (bus_for_each_dev) from [<c03b65b8>] (bus_add_driver+0x1a0/0x218)
[<c03b65b8>] (bus_add_driver) from [<c03b79cc>] (driver_register+0x78/0xf8)
[<c03b79cc>] (driver_register) from [<c01018f4>] (do_one_initcall+0x90/0x1dc)
[<c01018f4>] (do_one_initcall) from [<c0900de4>] (kernel_init_freeable+0x15c/0x1fc)
[<c0900de4>] (kernel_init_freeable) from [<c06eed90>] (kernel_init+0x8/0x114)
[<c06eed90>] (kernel_init) from [<c0107910>] (ret_from_fork+0x14/0x24)
Fixes: baa11ebc0c ("net: mvneta: Use the new hwbm framework")
Reported-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>