Commit graph

39705 commits

Author SHA1 Message Date
Peng Tao
39280a5ae8 nfs41: move file layout macros to generic pnfs
They can be reused by flexfile layout as well.

Also add a code such that if read fails on one DS and
there are other DSes available to use, don't resend
through MDS but through pNFS so that client can read
from other DSes.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:33 -08:00
Peng Tao
064172f345 nfs41: allow LD to choose DS connection auth flavor
flexfile layout may use different auth flavor as specified by MDS.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:33 -08:00
Peng Tao
7405f9e195 nfs41: pull nfs4_ds_connect from file layout to generic pnfs
It can be reused by flexfiles layout client.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Peng Tao
6b7f3cf963 nfs41: pull decode_ds_addr from file layout to generic pnfs
It can be reused by flexfile layout.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Peng Tao
875ae0694b nfs41: pull data server cache from file layout to generic pnfs
Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs.

They can all be reused by flexfile layout as well.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Tom Haynes
085d1e33a6 pnfs: Do not grab the commit_info lock twice when rescheduling writes
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Tom Haynes
f54bcf2ece pnfs: Prepare for flexfiles by pulling out common code
The flexfilelayout driver will share some common code
with the filelayout driver. This set of changes refactors
that common code out to avoid any module depenencies.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Gui Hecheng
b76808fc26 btrfs: cleanup init for list in free-space-cache
o removed an unecessary INIT_LIST_HEAD after LIST_HEAD

o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:50 -08:00
Shaohua Li
2f0810880f btrfs: delete chunk allocation attemp when setting block group ro
Below test will fail currently:
      mkfs.ext4 -F /dev/sda
      btrfs-convert /dev/sda
      mount /dev/sda /mnt
      btrfs device add -f /dev/sdb /mnt
      btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt

The reason is there are some block groups with usage 0, but the whole
disk hasn't free space to allocate new chunk, so we even can't set such
block group readonly. This patch deletes the chunk allocation when
setting block group ro. For META, we already have reserve. But for
SYSTEM, we don't have, so the check_system_chunk is still required.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:20 -08:00
Naohiro Aota
289454ad26 btrfs: clear bio reference after submit_one_bio()
After submit_one_bio(), `bio' can go away. However submit_extent_page()
leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
It will cause invalid paging request when submit_extent_page() is called
next time.

I reproduced ENOMEM case with the following script (need
CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).

  #!/bin/bash

  dmesgout=dmesg.txt
  start=100000
  end=300000
  step=1000

  # btrfs options
  device=/dev/vdb1
  directory=/mnt/btrfs

  # fault-injection options
  percent=100
  times=3

  mkdir -p $directory || exit 1
  mount -o compress $device $directory || exit 1

  rm -f $directory/file || exit 1
  dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1

  for interval in `seq $start $step $end`; do
          dmesg -C
          echo 1 > /proc/sys/vm/drop_caches
          sync
          export FAILCMD_TYPE=fail_page_alloc
          ./failcmd.sh -p $percent -t $times -i $interval \
                  --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                  -- \
                  cat $directory/file > /dev/null
          dmesg > ${dmesgout}
          if grep -q BUG: ${dmesgout}; then
                  cat ${dmesgout}
                  exit 1
          fi
  done

  umount $directory
  exit 0

Signed-off-by: Naohiro Aota <naota@elisp.net>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:51 -08:00
Filipe Manana
de554a4fa6 Btrfs: fix scrub race leading to use-after-free
While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
the following trace:

[68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
[68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
[68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
[68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
[68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
[68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
[68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
[68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
[68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
[68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
[68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
[68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
[68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
[68127.807663] Stack:
[68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
[68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
[68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
[68127.807663] Call Trace:
[68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
[68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
[68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
[68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
[68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
[68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
[68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
[68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
[68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
[68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
[68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
[68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
[68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
[68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
[68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
[68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
[68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
[68127.807663]  RSP <ffff8803f097fbb8>
[68127.807663] CR2: ffff8803f8947a50
[68127.807663] ---[ end trace d7045aac00a66cd8 ]---

This is due to a race that can happen in a very tiny time window and is
illustrated by the following sequence diagram:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       atomic_dec(&sctx->bios_in_flight)
                                                                   wait sctx->bios_in_flight == 0
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
       wake_up(&sctx->list_wait)
          __wake_up()
              spin_lock_irqsave(&sctx->list_wait->lock, flags)

Another variation of this scenario that results in the same use-after-free
issue is:

         CPU 1                                                     CPU 2

                                                                btrfs_scrub_dev()
                                                                   wait sctx->bios_in_flight == 0
scrub_bio_end_io_worker()
   scrub_pending_bio_dec()
       __wake_up(&sctx->list_wait)
          spin_lock_irqsave(&sctx->list_wait->lock, flags)
          default_wake_function()
              wake up task at CPU 2
                                                                   wait sctx->workers_pending == 0
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   (...)
                                                                   mutex_lock(&fs_info->scrub_lock)
                                                                   scrub_free_ctx(sctx)
                                                                      kfree(sctx)
          spin_unlock_irqrestore(&sctx->list_wait->lock, flags)

Fix this by holding the scrub lock while doing the wakeup.

This isn't a recent regression, the issue as been around since the scrub
feature was added (2011, commit a2de733c78).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:50 -08:00
Filipe Manana
001a648df4 Btrfs: add missing cleanup on sysfs init failure
If we failed during initialization of sysfs, we weren't unregistering the
top level btrfs sysfs entry nor the debugfs stuff.
Not unregistering the top level sysfs entry makes future attempts to reload
the btrfs module impossible and the following is reported in dmesg:

[ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0()
[ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs'
[ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs]
[ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G        W    3.13.0-fdm-btrfs-next-24+ #7
[ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2246.451312]  0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410
[ 2246.451314]  ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290
[ 2246.451316]  ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80
[ 2246.451318] Call Trace:
[ 2246.451322]  [<ffffffff816f1da6>] dump_stack+0x4e/0x68
[ 2246.451324]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
[ 2246.451325]  [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50
[ 2246.451328]  [<ffffffff81367dc5>] ? strlcat+0x65/0x90
(....)

This fixes the following change:

    btrfs: add simple debugfs interface
    commit 1bae30982b

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:49 -08:00
Filipe Manana
d4b450cd4b Btrfs: fix race between transaction commit and empty block group removal
Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

           CPU 1                                       CPU 2

btrfs_commit_transaction()
  fs_info->running_transaction = NULL
  btrfs_finish_extent_commit()
    find_first_extent_bit()
      -> found range for block group X
         in fs_info->freed_extents[]

                                               btrfs_delete_unused_bgs()
                                                 -> found block group X

                                                 Removed block group X's range
                                                 from fs_info->freed_extents[]

                                                 btrfs_remove_chunk()
                                                    btrfs_remove_block_group(bg X)

    unpin_extent_range(bg X range)
       btrfs_lookup_block_group(bg X)
          -> returns NULL
            -> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388]  RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

    Btrfs: remove empty block groups automatically
    commit 47ab2a6c68

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:48 -08:00
David Sterba
e3540eab29 btrfs: add more checks to btrfs_read_sys_array
Verify that the sys_array has enough bytes to read the next item.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:39 -08:00
David Sterba
1ffb22cf8c btrfs: cleanup, rename a few variables in btrfs_read_sys_array
There's a pointer to buffer, integer offset and offset passed as
pointer, try to find matching names for them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:28 -08:00
David Sterba
ce7fca5f57 btrfs: add checks for sys_chunk_array sizes
Verify that possible minimum and maximum size is set, validity of
contents is checked in btrfs_read_sys_array.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:23:43 -08:00
David Sterba
75d6ad382b btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
I received a few crafted images from Jiri, all got through the recently
added superblock checks. The lower bounds checks for num_devices and
sector/node -sizes were missing and caused a crash during mount.

Tools for symbolic code execution were used to prepare the images
contents.

Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:20:39 -08:00
chandan r
9cc97d6462 Btrfs: Add code to support file creation time
This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.

Signed-off-by: chandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: David Sterba <dsterba@suse.cz>

Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:16 -08:00
David Sterba
a937b9791e btrfs: kill btrfs_inode_*time helpers
They just opencode taking address of the timespec member.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 18:39:07 -08:00
Markus Elfring
648695c748 jfs: Deletion of an unnecessary check before the function call "unload_nls"
The unload_nls() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2015-02-02 15:02:34 -06:00
Christoph Hellwig
31ef83dc05 nfsd: add trace events
For now just a few simple events to trace the layout stateid lifetime, but
these already were enough to find several bugs in the Linux client layout
stateid handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:44 +01:00
Christoph Hellwig
c5c707f96f nfsd: implement pNFS layout recalls
Add support to issue layout recalls to clients.  For now we only support
full-file recalls to get a simple and stable implementation.  This allows
to embedd a nfsd4_callback structure in the layout_state and thus avoid
any memory allocations under spinlocks during a recall.  For normal
use cases that do not intent to share a single file between multiple
clients this implementation is fully sufficient.

To ensure layouts are recalled on local filesystem access each layout
state registers a new FL_LAYOUT lease with the kernel file locking code,
which filesystems that support pNFS exports that require recalls need
to break on conflicting access patterns.

The XDR code is based on the old pNFS server implementation by
Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
Marc Eshel, Mike Sager and Ricardo Labiaga.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:43 +01:00
Christoph Hellwig
9cf514ccfa nfsd: implement pNFS operations
Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
outstanding layouts and devices.

Layout management is very straight forward, with a nfs4_layout_stateid
structure that extends nfs4_stid to manage layout stateids as the
top-level structure.  It is linked into the nfs4_file and nfs4_client
structures like the other stateids, and contains a linked list of
layouts that hang of the stateid.  The actual layout operations are
implemented in layout drivers that are not part of this commit, but
will be added later.

The worst part of this commit is the management of the pNFS device IDs,
which suffers from a specification that is not sanely implementable due
to the fact that the device-IDs are global and not bound to an export,
and have a small enough size so that we can't store the fsid portion of
a file handle, and must never be reused.  As we still do need perform all
export authentication and validation checks on a device ID passed to
GETDEVICEINFO we are caught between a rock and a hard place.  To work
around this issue we add a new hash that maps from a 64-bit integer to a
fsid so that we can look up the export to authenticate against it,
a 32-bit integer as a generation that we can bump when changing the device,
and a currently unused 32-bit integer that could be used in the future
to handle more than a single device per export.  Entries in this hash
table are never deleted as we can't reuse the ids anyway, and would have
a severe lifetime problem anyway as Linux export structures are temporary
structures that can go away under load.

Parts of the XDR data, structures and marshaling/unmarshaling code, as
well as many concepts are derived from the old pNFS server implementation
from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
Mike Sager, Ricardo Labiaga and many others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:42 +01:00
Christoph Hellwig
4d227fca1b nfsd: make find_any_file available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:41 +01:00
Christoph Hellwig
e6ba76e194 nfsd: make find/get/put file available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:41 +01:00
Christoph Hellwig
cd61c52231 nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:40 +01:00
Christoph Hellwig
9558f2500a nfsd: add fh_fsid_match helper
Add a helper to check that the fsid parts of two file handles match.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:39 +01:00
Christoph Hellwig
4d94c2ef20 nfsd: move nfsd_fh_match to nfsfh.h
The pnfs code will need it too.  Also remove the nfsd_ prefix to match the
other filehandle helpers in that file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:39 +01:00
Christoph Hellwig
11afe9f76e fs: add FL_LAYOUT lease type
This (ab-)uses the file locking code to allow filesystems to recall
outstanding pNFS layouts on a file.  This new lease type is similar but
not quite the same as FL_DELEG.  A FL_LAYOUT lease can always be granted,
an a per-filesystem lock (XFS iolock for the initial implementation)
ensures not FL_LAYOUT leases granted when we would need to recall them.

Also included are changes that allow multiple outstanding read
leases of different types on the same file as long as they have a
differnt owner.  This wasn't a problem until now as nfsd never set
FL_LEASE leases, and no one else used FL_DELEG leases, but given that
nfsd will also issues FL_LAYOUT leases we will have to handle it now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:38 +01:00
Christoph Hellwig
2ab99ee124 fs: track fl_owner for leases
Just like for other lock types we should allow different owners to have
a read lease on a file.  Currently this can't happen, but with the addition
of pNFS layout leases we'll need this feature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-02-02 18:09:38 +01:00
Al Viro
15d0f5ea34 Make super_blocks and sb_lock static
The only user outside of fs/super.c is gone now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-02 10:07:59 -07:00
J. Bruce Fields
a584143b01 Merge branch 'locks-3.20' of git://git.samba.org/jlayton/linux into for-3.20
Christoph's block pnfs patches have some minor dependencies on these
lock patches.
2015-02-02 11:29:29 -05:00
Dave Chinner
179073620d Merge branch 'xfs-ioctl-setattr-cleanup' into for-next 2015-02-02 10:57:30 +11:00
Iustin Pop
9b94fcc398 xfs: fix behaviour of XFS_IOC_FSSETXATTR on directories
Currently, the ioctl handling code for XFS_IOC_FSSETXATTR treats all
targets as regular files: it refuses to change the extent size if
extents are allocated. This is wrong for directories, as there the
extent size is only used as a default for children.

The patch fixes this issue and improves validation of flag
combinations:

- only disallow extent size changes after extents have been allocated
  for regular files
- only allow XFS_XFLAG_EXTSIZE for regular files
- only allow XFS_XFLAG_EXTSZINHERIT for directories
- automatically clear the flags if the extent size is zero

Thanks to Dave Chinner for guidance on the proper fix for this issue.

[dchinner: ported changes onto cleanup series. Makes changes clear
	   and obvious.]
[dchinner: added comments documenting validity checking rules.]

Signed-off-by: Iustin Pop <iustin@k1024.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:26:26 +11:00
Dave Chinner
23bd0735cf xfs: factor projid hint checking out of xfs_ioctl_setattr
The project ID change checking is one of the few remaining open
coded checks in xfs_ioctl_setattr(). Factor it into a helper
function so that the setattr code mostly becomes a flow of check
and action helpers, making it easier to read and follow.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:22:53 +11:00
Dave Chinner
d4388d3c09 xfs: factor extsize hint checking out of xfs_ioctl_setattr
The extent size hint change checking is fairly complex, so isolate
that into it's own function. This simplifies the logic flow of the
setattr code, making it easier to read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:22:20 +11:00
Dave Chinner
41c145271d xfs: XFS_IOCTL_SETXATTR can run in user namespaces
Currently XFS_IOCTL_SETXATTR will fail if run in a user namespace as
it it not allowed to change project IDs. The current code, however,
also prevents any other change being made as well, so things like
extent size hints cannot be set in user namespaces. This is wrong,
so only disallow access to project IDs and related flags from inside
the init namespace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:17:51 +11:00
Dave Chinner
fd179b9c3b xfs: kill xfs_ioctl_setattr behaviour mask
Now there is only one caller to xfs_ioctl_setattr that uses all the
functionality of the function we can kill the behviour mask and
start cleaning up the code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:16:25 +11:00
Dave Chinner
f96291f6a3 xfs: disaggregate xfs_ioctl_setattr
xfs_ioctl_setxflags doesn't need all of the functionailty in
xfs_ioctl_setattr() and now we have separate helper functions that
share the checks and modifications that xfs_ioctl_setxflags
requires. Hence disaggregate it from xfs_ioctl_setattr() to allow
further work to be done on xfs_ioctl_setattr.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:15:56 +11:00
Dave Chinner
8f3d17ab06 xfs: factor out xfs_ioctl_setattr transaciton preamble
The setup of the transaction is done after a random smattering of
checks and before another bunch of ioperations specific
validity checks. Pull all the preamble out into a helper function
that returns a transaction or error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:15:35 +11:00
Dave Chinner
29a17c00d4 xfs: separate xflags from xfs_ioctl_setattr
The setting of the extended flags is down through two separate
interfaces, but they are munged together into xfs_ioctl_setattr
and make that function far more complex than it needs to be.
Separate it out into a helper function along with all the other
common inode changes and transaction manipulations in
xfs_ioctl_setattr().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:14:25 +11:00
Dave Chinner
817b6c480e xfs: FSX_NONBLOCK is not used
It is set if the filp is set ot non-blocking, but the flag is not
used anywhere. Hence we can kill it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:14:04 +11:00
Dave Chinner
3fd1b0d158 Merge branch 'xfs-misc-fixes-for-3.20-3' into for-next 2015-02-02 10:03:18 +11:00
Christoph Hellwig
2ba6623702 xfs: don't allocate an ioend for direct I/O completions
Back in the days when the direct I/O ->end_io callback could be called
from interrupt context for AIO we needed a structure to hand off to the
workqueue, and reused the ioend structure for this purpose.  These days
->end_io is always called from user or workqueue context, which allows us
to avoid this memory allocation and simplify the code significantly.

[dchinner: removed now unused xfs_finish_ioend_sync() function after
	   Brian Foster did an initial review. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:02:09 +11:00
Wang, Yalin
f3d215526e xfs: change kmem_free to use generic kvfree()
Change kmem_free to use kvfree() generic function, remove the
duplicated code.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 09:54:18 +11:00
Christoph Hellwig
8add71ca3f xfs: factor out a xfs_update_prealloc_flags() helper
This logic is duplicated in xfs_file_fallocate and xfs_ioc_space, and
we'll need another copy of it for pNFS block support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 09:53:56 +11:00
Dan Carpenter
c7c545d4a3 NFS: a couple off by ones
These tests are off by one because if len == sizeof(nfs_export_path)
then we have truncated the name.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:43:30 -05:00
Omar Sandoval
3a7ed3fff3 nfs: prevent truncate on active swapfile
Most filesystems prevent truncation of an active swapfile by way of
inode_newsize_ok, called from inode_change_ok. NFS doesn't call either
from nfs_setattr, presumably because most of these checks are expected
to be done server-side. However, the IS_SWAPFILE check can only be done
client-side, and truncating a swapfile can't possibly be good.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:43:29 -05:00
Jeff Layton
6ffa30d3f7 nfs: don't call blocking operations while !TASK_RUNNING
Bruce reported seeing this warning pop when mounting using v4.1:

     ------------[ cut here ]------------
     WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
    CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
     0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
     0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
     ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
    Call Trace:
     [<ffffffff8186ac78>] dump_stack+0x4c/0x65
     [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0
     [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0
     [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430
     [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130
     [<ffffffff810d941e>] groups_alloc+0x3e/0x130
     [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc]
     [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc]
     [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc]
     [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc]
     [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
     [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0
     [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
     [<ffffffff810d45bf>] kthread+0x11f/0x140
     [<ffffffff810ea815>] ? local_clock+0x15/0x30
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
     [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
    ---[ end trace 675220a11e30f4f2 ]---

nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
which is just wrong. Fix that by finishing the wait immediately if we've
found that the list has something on it.

Also, we don't expect this kthread to accept signals, so we should be
using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
hung task warnings from the watchdog, so have the schedule_timeout
wake up every 60s if there's no callback activity.

Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:39:50 -05:00
Linus Torvalds
bc208e0ee0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have one more fix for btrfs in my for-linus branch - this was a bug
  in the new raid5/6 scrubbing support"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix raid56 scrub failed in xfstests btrfs/072
2015-01-30 14:25:52 -08:00