Commit graph

518990 commits

Author SHA1 Message Date
Herbert Xu
654ae152b3 crypto: algif_rng - Remove obsolete const-removal cast
Now that crypto_rng_reset takes a const argument, we no longer
need to cast away the const qualifier.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:21 +08:00
Herbert Xu
94f1bb15be crypto: rng - Remove old low-level rng interface
Now that all rng implementations have switched over to the new
interface, we can remove the old low-level interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:20 +08:00
Herbert Xu
e33cf2c5aa crypto: krng - Convert to new rng interface
This patch ocnverts the KRNG implementation to the new low-level
rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:18 +08:00
Herbert Xu
e7c2422a83 crypto: ansi_cprng - Convert to new rng interface
This patch ocnverts the ANSI CPRNG implementation to the new
low-level rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2015-04-22 09:30:18 +08:00
Herbert Xu
6f7e3caa9e crypto: ansi_cprng - Remove bogus inclusion of internal.h
The file internal.h is only meant to be used by internel API
implementation and not algorithm implementations.  In fact it
isn't even needed here so this patch removes it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2015-04-22 09:30:17 +08:00
Herbert Xu
8fded5925d crypto: drbg - Convert to new rng interface
This patch converts the DRBG implementation to the new low-level
rng interface.

This allows us to get rid of struct drbg_gen by using the new RNG
API instead.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephan Mueller <smueller@chronox.de>
2015-04-22 09:30:17 +08:00
Herbert Xu
881cd6c570 crypto: rng - Add multiple algorithm registration interface
This patch adds the helpers that allow the registration and removal
of multiple RNG algorithms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:16 +08:00
Herbert Xu
7ca99d8148 crypto: rng - Add crypto_rng_set_entropy
This patch adds the function crypto_rng_set_entropy.  It is only
meant to be used by testmgr when testing RNG implementations by
providing fixed entropy data in order to verify test vectors.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:15 +08:00
Herbert Xu
acec27ff35 crypto: rng - Convert low-level crypto_rng to new style
This patch converts the low-level crypto_rng interface to the
"new" style.

This allows existing implementations to be converted over one-
by-one.  Once that is complete we can then remove the old rng
interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:14 +08:00
Herbert Xu
3c5d8fa9f5 crypto: rng - Mark crypto_rng_reset seed as const
There is no reason why crypto_rng_reset should modify the seed
so this patch marks it as const.  Since our algorithms don't
export a const seed function yet we have to go through some
contortions for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-04-22 09:30:07 +08:00
Linus Torvalds
f614c8178b Merge branch 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
 "The patch by Guenter Roeck fixes the build on parisc which got broken
  because of commit f24ffde432 ("parisc: expose number of page table
  levels on Kconfig level") and the patch from Matthew Wilcox converts
  our code to use the generic scatterlist.h header file"

* 'parisc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Replace PT_NLEVELS with CONFIG_PGTABLE_LEVELS
  parisc: Eliminate sg_virt_addr() and private scatterlist.h
2015-04-21 17:57:28 -07:00
Eric Mei
9ffc8f7cb9 md/raid5: don't do chunk aligned read on degraded array.
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.


Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
edbe83ab4c md/raid5: allow the stripe_cache to grow and shrink.
The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
5423399a84 md/raid5: change ->inactive_blocked to a bit-flag.
This allows us to easily add more (atomic) flags.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
486f0644c3 md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
NeilBrown
a9683a795b md/raid5: pass gfp_t arg to grow_one_stripe()
This is needed for future improvement to stripe cache management.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
d06f191f8e md/raid5: introduce configuration option rmw_level
Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
584acdd49c md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
a582564b24 md/raid6 algorithms: xor_syndrome() for SSE2
The second and (last) optimized XOR syndrome calculation. This version
supports right and left side optimization. All CPUs with architecture
older than Haswell will benefit from it.

It should be noted that SSE2 movntdq kills performance for memory areas
that are read and written simultaneously in chunks smaller than cache
line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR
functions.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
9a5ce91d05 md/raid6 algorithms: xor_syndrome() for generic int
Start the algorithms with the very basic one. It is left and right
optimized. That means we can avoid all calculations for unneeded pages
above the right stop offset. For pages below the left start offset we
still need the syndrome multiplication but without reading data pages.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
7e92e1d762 md/raid6 algorithms: improve test program
It is always helpful to have a test tool in place if we implement
new data critical algorithms. So add some test routines to the raid6
checker that can prove if the new xor_syndrome() works as expected.

Run through all permutations of start/stop pages per algorithm and
simulate a xor_syndrome() assisted rmw run. After each rmw check if
the recovery algorithm still confirms that the stripe is fine.

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
fe5cbc6e06 md/raid6 algorithms: delta syndrome functions
v3: s-o-b comment, explanation of performance and descision for
the start/stop implementation

Implementing rmw functionality for RAID6 requires optimized syndrome
calculation. Up to now we can only generate a complete syndrome. The
target P/Q pages are always overwritten. With this patch we provide
a framework for inplace P/Q modification. In the first place simply
fill those functions with NULL values.

xor_syndrome() has two additional parameters: start & stop. These
will indicate the first and last page that are changing during a
rmw run. That makes it possible to avoid several unneccessary loops
and speed up calculation. The caller needs to implement the following
logic to make the functions work.

1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source
blocks inside P/Q between (and including) start and end.

2) modify any block with start <= block <= stop

3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of
source blocks into P/Q between (and including) start and end.

Pages between start and stop that won't be changed should be filled
with a pointer to the kernel zero page. The reasons for not taking NULL
pages are:

1) Algorithms cross the whole source data line by line. Thus avoid
additional branches.

2) Having a NULL page avoids calculating the XOR P parity but still
need calulation steps for the Q parity. Depending on the algorithm
unrolling that might be only a difference of 2 instructions per loop.

The benchmark numbers of the gen_syndrome() functions are displayed in
the kernel log. Do the same for the xor_syndrome() functions. This
will help to analyze performance problems and give an rough estimate
how well the algorithm works. The choice of the fastest algorithm will
still depend on the gen_syndrome() performance.

With the start/stop page implementation the speed can vary a lot in real
life. E.g. a change of page 0 & page 15 on a stripe will be harder to
compute than the case where page 0 & page 1 are XOR candidates. To be not
to enthusiatic about the expected speeds we will run a worse case test
that simulates a change on the upper half of the stripe. So we do:

1) calculation of P/Q for the upper pages

2) continuation of Q for the lower (empty) pages

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
dabc4ec6ba raid5: handle expansion/resync case with stripe batching
expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
72ac733015 raid5: handle io error of batch list
If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
59fc630b8b RAID5: batch adjacent full stripe write
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
7a87f43405 raid5: track overwrite disk count
Track overwrite disk count, so we can know if a stripe is a full stripe write.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
da41ba6597 raid5: add a new flag to track if a stripe can be batched
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
46d5b78562 raid5: use flex_array for scribble data
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
Heinz Mauelshagen
753f2856cd md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
NeilBrown
ac8fa4196d md: allow resync to go faster when there is competing IO.
When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster.  Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
 - for some loads, the resync speed is unchanged.  For those loads
   increasing the minimum doesn't change the speed either.
   So this is a good result.  To increase resync speed under such
   loads we would probably need to increase the resync window
   size.

 - for other loads, resync speed does increase to a reasonable
   fraction (e.g. 20%) of maximum possible, and throughput of
   the load only drops a little bit (e.g. 10%)

 - for other loads, throughput of the non-sync load drops quite a bit
   more.  These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
09314799e4 md: remove 'go_faster' option from ->sync_request()
This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
50c37b136a md: don't require sync_min to be a multiple of chunk_size.
There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
d51e4fe6d6 Merge branch 'cluster' into for-next 2015-04-22 08:00:20 +10:00
Goldwyn Rodrigues
97f6cd39da md-cluster: re-add capabilities
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
a6da4ef85c md: re-add a failed disk
This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
88bcfef7be md-cluster: remove capabilities
This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
57d051dcca md: Export and rename find_rdev_nr_rcu
This is required by the clustering module (patches to follow) to
find the device to remove or re-add.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
fb56dfef4e md: Export and rename kick_rdev_from_array
This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Guoqing Jiang
8c58f02e24 md-cluster: correct the num for comparison
Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:58:31 +10:00
Eran Ben Elisha
fab9adfb71 net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP
Currently we parse max_msg_sz from the wrong offset in QUERY_DEV_CAP,
fix to use the right offset.

Fixes: 0b131561a7 ('net/mlx4_en: Add Flow control statistics [..]')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21 17:36:08 -04:00
Bjorn Helgaas
9fbbda5c8e ia64/PCI: Treat all host bridge Address Space Descriptors (even consumers) as windows
Prior to c770cb4cb5 ("PCI: Mark invalid BARs as unassigned"), if we tried
to claim a PCI BAR but could not find an upstream bridge window that
matched it, we complained but still allowed the device to be enabled.

c770cb4cb5 broke devices that previously worked (mptsas and igb in the
case Tony reported, but it could be any devices) because it marks those
BARs as IORESOURCE_UNSET, which makes pci_enable_device() complain and
return failure:

  igb 0000:81:00.0: can't enable device: BAR 0 [mem size 0x00020000] not assigned
  igb: probe of 0000:81:00.0 failed with error -22

The underlying cause is an ACPI Address Space Descriptor for a PCI host
bridge window that is marked as "consumer".  This is a firmware defect:
resources that are produced on the downstream side of a bridge should be
marked "producer".  But rejecting these BARs that we previously allowed is
a functionality regression, and firmware has not used the producer/consumer
bit consistently, so we can't rely on it anyway.

Stop checking the producer/consumer bit, and assume all bridge Address
Space Descriptors are for bridge windows.

Note that this change does not affect I/O Port or Fixed Location I/O Port
Descriptors, which are commonly used for the [io 0x0cf8-0x0cff] config
access range.  That range is a "consumer" range and should not be treated
as a window.

Fixes: c770cb4cb5 ("PCI: Mark invalid BARs as unassigned")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96961
Reported-and-tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-21 15:29:39 -05:00
Sowmini Varadhan
0edfad5959 sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in softirq context
Since it is possible for vnet_event_napi to end up doing
vnet_control_pkt_engine -> ... -> vnet_send_attr ->
vnet_port_alloc_tx_ring -> ldc_alloc_exp_dring -> kzalloc()
(i.e., in softirq context), kzalloc() should be called with
GFP_ATOMIC from ldc_alloc_exp_dring.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21 13:17:06 -07:00
Andreas Gruenbacher
bff175238a uapi: Remove kernel internal declaration
The enum nfs4_acl_whotype is only used in nfs4d's internal nfs4 acl
representation. No longer expose it to user space.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21 16:16:04 -04:00
Giuseppe Cantavenera
bb7ffbf29e nfsd: fix nsfd startup race triggering BUG_ON
nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...)
in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id.
The following was observed on a MIPS 32-core processor:
kernel: Call Trace:
kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd]
kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8
kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70
kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0
kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0
kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8
kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0
kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28
kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8
kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68
kernel:
kernel:
        Code: 0040f809  00000000  2e020001 <00020336> 3c12c00d
                3c02801a  de100000 6442eb98  0040f809
kernel: ---[ end trace 7471374335809536 ]---

Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before
registering rpc_pipefs_event(...) with the notifier chain.

Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com>
Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com>
Reviewed-by: Kinlong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21 16:16:03 -04:00
Mark Salter
135dd002c2 nfsd: eliminate NFSD_DEBUG
Commit f895b252d4 ("sunrpc: eliminate RPC_DEBUG") introduced
use of IS_ENABLED() in a uapi header which leads to a build
failure for userspace apps trying to use <linux/nfsd/debug.h>:

   linux/nfsd/debug.h:18:15: error: missing binary operator before token "("
  #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
                ^

Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG
is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG.

Cc: stable@vger.kernel.org
Fixes: f895b252d4 "sunrpc: eliminate RPC_DEBUG"
Signed-off-by: Mark Salter <msalter@redhat.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21 16:16:02 -04:00
J. Bruce Fields
6e4891dc28 nfsd4: fix READ permission checking
In the case we already have a struct file (derived from a stateid), we
still need to do permission-checking; otherwise an unauthorized user
could gain access to a file by sniffing or guessing somebody else's
stateid.

Cc: stable@vger.kernel.org
Fixes: dc97618ddd "nfsd4: separate splice and readv cases"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21 16:16:01 -04:00
J. Bruce Fields
980608fb50 nfsd4: disallow SEEK with special stateids
If the client uses a special stateid then we'll pass a NULL file to
vfs_llseek.

Fixes: 24bab49122 " NFSD: Implement SEEK"
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: stable@vger.kernel.org
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21 16:16:01 -04:00
David S. Miller
df386375ff sparc64: Use M7 PMC write on all chips T4 and onward.
They both work equally well, and the M7 implementation is
simpler and cheaper (less register writes).

With help from David Ahern.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21 13:14:53 -07:00
Bjorn Helgaas
af539e985b MAINTAINERS: Remove Mohit Kumar (email bounces)
Email to Mohit Kumar <mohit.kumar@st.com> has been bouncing, so remove the
address from MAINTAINERS and add an entry in CREDITS.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2015-04-21 15:12:22 -05:00
Guenter Roeck
c19edb6946 parisc: Replace PT_NLEVELS with CONFIG_PGTABLE_LEVELS
The following warning is seen when compiling parisc images

./arch/parisc/include/asm/pgalloc.h: In function 'pgd_alloc':
./arch/parisc/include/asm/pgalloc.h:29:5: warning: "PT_NLEVELS" is not defined

Some definitions of PT_NLEVELS were missed with the conversion to
CONFIG_PGTABLE_LEVELS.

Fixes: f24ffde432 ("parisc: expose number of page table levels
	on Kconfig level")
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2015-04-21 22:04:03 +02:00