Commit graph

545354 commits

Author SHA1 Message Date
Randy Dunlap
7fadc59cc8 fs: fix fs/locks.c kernel-doc warning
Fix kernel-doc warnings in fs/locks.c:

Warning(..//fs/locks.c:1577): No description found for parameter 'flags'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-08-31 16:27:25 -04:00
J. Bruce Fields
1f65c17e15 nfsd: Add Jeff Layton as co-maintainer
Jeff has been doing a lot of development (including much of the
state-locking rewrite just as one example) plus lots of review and other
miscellaneous nfsd work, so let's acknowledge the status quo.

I'll continue to be the one to send regular pull requests but Jeff will
should be available to cover there occasionally too.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:27:24 -04:00
Kinglong Mee
75976de655 NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
Security label can be set in OPEN/CREATE request, nfsd should set
the bitmask in word2 if setting success.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:40 -04:00
Kinglong Mee
ead8fb8c24 NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
According to rfc5661 18.16.4,
"If EXCLUSIVE4_1 was used, the client determines the attributes
 used for the verifier by comparing attrset with cva_attrs.attrmask;"

So, EXCLUSIVE4_1 also needs those bitmask used to store the verifier.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:39 -04:00
Kinglong Mee
7d580722c9 nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
The encode order should be as the bitmask defined order.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:39 -04:00
Kinglong Mee
6896f15aab nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
Currently we'll respond correctly to a request for either
FS_LAYOUT_TYPES or LAYOUT_TYPES, but not to a request for both
attributes simultaneously.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:12:39 -04:00
Wang Nan
acf860ae7c bpf tools: New API to get name from a BPF object
Before this patch there's no way to connect a loaded bpf object
to its source file. However, during applying perf's '--filter' to BPF
object, without this connection makes things harder, because perf loads
all programs together, but '--filter' setting is for each object.

The API of bpf_object__open_buffer() is changed to allow passing a name.
Fortunately, at this time there's only one user of it (perf test LLVM),
so we change it together.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1440742821-44548-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-31 16:53:15 -03:00
Mike Snitzer
dc9cee5db5 dm cache: small cleanups related to deferred prison cell cleanup
Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-31 15:50:28 -04:00
Nikolay Aleksandrov
6ea3c9d5b0 mpls: fix mpls_net_init memory leak
Fix a memory leak in the mpls netns init function in case of failure. If
register_net_sysctl fails then we need to free the ctl_table.

Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:45:09 -07:00
David Ahern
f0fa6e529e net: Add tos to validate source tracepoint
TOS is another key aspect of the lookup passed to fib_validate_source.
Add it to the tracepoint.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:42:04 -07:00
Alexei Starovoitov
dbb7ee0e47 lib: move strncpy_from_unsafe() into mm/maccess.c
To fix build errors:
kernel/built-in.o: In function `bpf_trace_printk':
bpf_trace.c:(.text+0x11a254): undefined reference to `strncpy_from_unsafe'
kernel/built-in.o: In function `fetch_memory_string':
trace_kprobe.c:(.text+0x11acf8): undefined reference to `strncpy_from_unsafe'

move strncpy_from_unsafe() next to probe_kernel_read/write()
which use the same memory access style.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 1a6877b9c0 ("lib: introduce strncpy_from_unsafe()")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:36:10 -07:00
David S. Miller
9dc30648f0 Merge branch 'per-route-dctcp-receive-side'
Daniel Borkmann says:

====================
tcp: receive-side per route dctcp handling

Original cover letter:

  Currently, the following case doesn't use DCTCP, even if it should:

    - responder has f.e. cubic as system wide default
    - 'ip route congctl dctcp $src' was set

  Then, DCTCP is NOT used if a DCTCP sender attempts to connect from a
  host in the $src range: ECT(0) is set, but listen_sk is not dctcp, so
  we fail the INET_ECN_is_not_ect sanity check.

  We also have to examine the dst used for the SYN/ACK reply to make
  this case work.

  In order to minimize additional cost, store the 'ecn is must have'
  information is the dst_features field.

  The set targets -next instead of -net since this doesn't seem to be a
  serious bug and to give the change more soak time until it hits linus
  tree.

v1 -> v2:
 - Addressed Dave's feedback, not exposing any bits to user space
 - Added patch 3 to reject incorrect configurations
 - Rest as is, rebased and retested
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Daniel Borkmann
c3a8d94746 tcp: use dctcp if enabled on the route to the initiator
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Daniel Borkmann
b8d3e4163a fib, fib6: reject invalid feature bits
Feature bits that are invalid should not be accepted by the kernel,
only the lower 4 bits may be configured, but not the remaining ones.
Even from these 4, 2 of them are unused.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Daniel Borkmann
1bb14807bc net: fib6: reduce identation in ip6_convert_metrics
Reduce the identation a bit, there's no need to artificically have
it increased.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Florian Westphal
6cf9dfd3bd net: fib: move metrics parsing to a helper
fib_create_info() is already quite large, so before adding more
code to the metrics section move that to a helper, similar to
ip6_convert_metrics.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Philip Downey
87583ebb9f IGMP: Document igmp_link_local_mcast_reports
Document the addition of a new sysctl variable which controls the
generation of IGMP reports for link local multicast groups in the
224.0.0.X range.

IGMP reports for local multicast groups can now be optionally
inhibited by setting the value to zero e.g.:
echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports

To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero.

Signed-off-by: Philip Downey <pdowney@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:30:37 -07:00
Pravin B Shelar
4c22279848 ip-tunnel: Use API to access tunnel metadata options.
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:28:56 -07:00
Kinglong Mee
0a2050d744 NFSD: Store parent's stat in a separate value
After commit ae7095a7c4 (nfsd4: helper function for getting mounted_on
ino) we ignore the return value from get_parent_attributes().

Also, the following FATTR4_WORD2_LAYOUT_BLKSIZE uses stat.blksize, so to
avoid overwriting that, use an independent value for the parent's
attributes.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 15:11:05 -04:00
Joe Thornber
9153df7405 dm cache: fix leaking of deferred bio prison cells
There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2a3 ("dm cache: defer whole cells")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-08-31 15:08:14 -04:00
Heinz Mauelshagen
f15f4d7200 dm raid: document RAID 4/5/6 discard support
For RAID 4/5/6 data integrity reasons 'discard_zeroes_data' must work
properly.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-31 15:05:31 -04:00
Adrian Hunter
97db62062a perf tools: Fix build on powerpc broken by pt/bts
It is theoretically possible to process perf.data files created on x86
and that contain Intel PT or Intel BTS data, on any other architecture,
which is why it is possible for there to be build errors on powerpc
caused by pt/bts.

The errors were:

	util/intel-pt-decoder/intel-pt-insn-decoder.c: In function ‘intel_pt_insn_decoder’:
	util/intel-pt-decoder/intel-pt-insn-decoder.c:138:3: error: switch missing default case [-Werror=switch-default]
	   switch (insn->immediate.nbytes) {
	   ^
	cc1: all warnings being treated as errors

	linux-acme.git/tools/perf/perf-obj/libperf.a(libperf-in.o): In function `intel_pt_synth_branch_sample':
	sources/linux-acme.git/tools/perf/util/intel-pt.c:871: undefined reference to `tsc_to_perf_time'
	linux-acme.git/tools/perf/perf-obj/libperf.a(libperf-in.o): In function `intel_pt_sample':
	sources/linux-acme.git/tools/perf/util/intel-pt.c:915: undefined reference to `tsc_to_perf_time'
	sources/linux-acme.git/tools/perf/util/intel-pt.c:962: undefined reference to `tsc_to_perf_time'
	linux-acme.git/tools/perf/perf-obj/libperf.a(libperf-in.o): In function `intel_pt_process_event':
	sources/linux-acme.git/tools/perf/util/intel-pt.c:1454: undefined reference to `perf_time_to_tsc'

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1441046384-28663-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-31 15:47:33 -03:00
Madalin Bucur
d1bfc62591 ipv4: fix 32b build
Address remaining issue after 80ec192.

Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 11:32:41 -07:00
NeilBrown
c3cce6cda1 md/raid5: ensure device failure recorded before write request returns.
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.


Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:59 +02:00
NeilBrown
34a6f80e16 md/raid5: use bio_list for the list of bios to return.
This will make it easier to splice two lists together which will
be needed in future patch.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:50 +02:00
NeilBrown
95af587e95 md/raid10: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:45 +02:00
NeilBrown
55ce74d4bf md/raid1: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:23 +02:00
NeilBrown
18b9f67962 md-cluster: remove inappropriate try_module_get from join()
md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:17 +02:00
NeilBrown
6022e75bf0 md: extend spinlock protection in register_md_cluster_operations
This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen.  But make the code look safe anyway.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:59 +02:00
Guoqing Jiang
abb9b22ac9 md-cluster: Read the disk bitmap sb and check if it needs recovery
In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:41 +02:00
Guoqing Jiang
eece075cda md-cluster: only call complete(&cinfo->completion) when node join cluster
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:31 +02:00
Guoqing Jiang
6e6d9f2cda md-cluster: add missed lockres_free
We also need to free the lock resource before goto out.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:23 +02:00
Guoqing Jiang
b2b9bfff0a md-cluster: remove the unused sb_lock
The sb_lock is not used anywhere, so let's remove it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:14 +02:00
Guoqing Jiang
9e3072e373 md-cluster: init suspend_list and suspend_lock early in join
If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:05 +02:00
Guoqing Jiang
b5ef56789b md-cluster: add the error check if failed to get dlm lock
In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:56 +02:00
Guoqing Jiang
b83d51c078 md-cluster: init completion within lockres_init
We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:50 +02:00
Guoqing Jiang
66099bb0ee md-cluster: fix deadlock issue on message lock
There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.

	Sender                	    Receiver        		Receiver

	token(EX)
       message(EX)
      writes message
   downconverts message(CR)
      requests ack(EX)
		                  get message(CR)            gets message(CR)
                		  reads message                reads message
		               requests EX on message    requests EX on message

To fix this problem, we do the following changes:

1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.

And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Lidong Zhong <ldzhong@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:41 +02:00
Guoqing Jiang
dc737d7c3d md-cluster: transfer the resync ownership to another node
When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:12 +02:00
Guoqing Jiang
05cd0e5176 md-cluster: split recover_slot for future code reuse
Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:41 +02:00
Guoqing Jiang
b89f704a8d md-cluster: use %pU to print UUIDs
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:30 +02:00
Sasha Levin
25b2edfa3b md: setup safemode_timer before it's being used
We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.

neilb: delete init_timer() call as setup_timer() does that.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:39:39 +02:00
NeilBrown
6cbd81487f md/raid5: handle possible race as reshape completes.
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:59 +02:00
NeilBrown
5ed1df2eac md: sync sync_completed has correct value as recovery finishes.
There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.

So:
 - don't set curr_resync_completed beyond the end of the devices,
 - set it correctly when resync/recovery has completed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:17 +02:00
NeilBrown
c5e19d906a md: be careful when testing resync_max against curr_resync_completed.
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:33 +02:00
NeilBrown
a4a3d26d87 md: set MD_RECOVERY_RECOVER when starting a degraded array.
This ensures that 'sync_action' will show 'recover' immediately the
array is started.  If there is no spare the status will change to
'idle' once that is detected.

Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.

This allows scripts which monitor status not to get confused -
particularly my test scripts.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:03 +02:00
NeilBrown
c74c0d760e md/raid5: remove incorrect "min_t()" when calculating writepos.
This code is calculating:
  writepos, which is the furthest along address (device-space) that we
     *will* be writing to
  readpos, which is the earliest address that we *could* possible read
     from, and
  safepos, which is the earliest address in the 'old' section that we
     might read from after a crash when the reshape position is
     recovered from metadata.

  The first is a precise calculation, so clipping at zero doesn't
  make sense.  As the reshape position is now guaranteed to always be
  a multiple of reshape_sectors and as we already BUG_ON when
  reshape_progress is zero, there is no point in this min_t() call.

  The readpos and safepos are worst case - actual value depends on
  precise geometry.  That worst case could be negative, which is only
  a problem because we are storing the value in an unsigned.
  So leave the min_t() for those.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:36:06 +02:00
NeilBrown
05256d9884 md/raid5: strengthen check on reshape_position at run.
When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time.  So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.

This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:34:21 +02:00
NeilBrown
3cb5edf454 md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible
The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken.  So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.

So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.

This could result in corruption if new values were written at awkward
times.

Also use min() or max() rather than open-coding.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:48 +02:00
NeilBrown
5cac6bcb93 md/raid5: always set conf->prev_chunk_sectors and ->prev_algo
These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:25 +02:00
NeilBrown
02ec50265b md/raid10: fix a few typos in comments
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:09 +02:00