evie/android_kernel_oneplus_msm8998 - Gay Catgirls Forgejo: gay catgirls having sex

evie/android_kernel_oneplus_msm8998

Author	SHA1	Message	Date
Liu Bo	2a85d9cac1	Btrfs: fix possible deadlock in btrfs_cleanup_transaction [13654.480669] ====================================================== [13654.480905] [ INFO: possible circular locking dependency detected ] [13654.481003] 3.12.0+ #4 Tainted: G W O [13654.481060] ------------------------------------------------------- [13654.481060] btrfs-transacti/9347 is trying to acquire lock: [13654.481060] (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] but task is already holding lock: [13654.481060] (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs] [13654.481060] which lock already depends on the new lock. [13654.481060] the existing dependency chain (in reverse order) is: [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs] [13654.481060] [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs] [13654.481060] [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs] [13654.481060] [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs] [13654.481060] [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs] [13654.481060] [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs] [13654.481060] [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs] [13654.481060] [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs] [13654.481060] [<ffffffff8114be91>] do_writepages+0x21/0x50 [13654.481060] [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60 [13654.481060] [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20 [13654.481060] [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs] [13654.481060] [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs] [13654.481060] [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs] [13654.481060] [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs] [13654.481060] [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs] [13654.481060] [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs] [13654.481060] [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs] [13654.481060] [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs] [13654.481060] [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs] [13654.481060] [<ffffffff811b2409>] mount_fs+0x39/0x1b0 [13654.481060] [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0 [13654.481060] [<ffffffff811cfcce>] do_mount+0x23e/0xa90 [13654.481060] [<ffffffff811d05a3>] SyS_mount+0x83/0xc0 [13654.481060] [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70 [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs] [13654.481060] [<ffffffff81079efa>] kthread+0xea/0xf0 [13654.481060] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [13654.481060] other info that might help us debug this: [13654.481060] Possible unsafe locking scenario: [13654.481060] CPU0 CPU1 [13654.481060] ---- ---- [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] * DEADLOCK * [...] ====================================================== btrfs_destroy_all_ordered_extents() gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock, while btrfs_[add,remove]_ordered_extent() acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock. This patch fixes the above problem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:37 -04:00
Filipe David Borba Manana	d5f375270a	Btrfs: faster/more efficient insertion of file extent items This is an extension to my previous commit titled: "Btrfs: faster file extent item replace operations" (hash `1acae57b16`) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:37 -04:00
Stanislaw Gruszka	51b98effa4	btrfs: always choose work from prio_head first In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:36 -04:00
Wang Shilong	dcfd5ad2fc	Revert "Btrfs: remove transaction from btrfs send" This reverts commit `41ce9970a8`. Previously i was thinking we can use readonly root's commit root safely while it is not true, readonly root may be cowed with the following cases. 1.snapshot send root will cow source root. 2.balance,device operations will also cow readonly send root to relocate. So i have two ideas to make us safe to use commit root. -->approach 1: make it protected by transaction and end transaction properly and we research next item from root node(see btrfs_search_slot_for_read()). -->approach 2: add another counter to local root structure to sync snapshot with send. and add a global counter to sync send with exclusive device operations. So with approach 2, send can use commit root safely, because we make sure send root can not be cowed during send. Unfortunately, it make codes ugly and more complex to maintain. To make snapshot and send exclusively, device operations and send operation exclusively with each other is a little confusing for common users. So why not drop into previous way. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:35 -04:00
Wang Shilong	bcbba5e659	Btrfs: skip readonly root for snapshot-aware defragment Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:34 -04:00
Wang Shilong	850a8cdffe	Btrfs: switch to btrfs_previous_extent_item() Since we have introduced btrfs_previous_extent_item() to search previous extent item, just switch into it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:54 -04:00
Hidetoshi Seto	f88ba6a2a4	Btrfs: skip submitting barrier for missing device I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:53 -04:00
Josef Bacik	29bce2f399	Btrfs: unlock extent and pages on error in cow_file_range When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:53 -04:00
Josef Bacik	c581afc8db	Btrfs: balance delayed inode updates While trying to reproduce a delayed ref problem I noticed the box kept falling over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's. Turns out this is because we only throttle delayed inode updates in btrfs_dirty_inode, which doesn't actually get called that often, especially when all you are doing is creating a bunch of files. So balance delayed inode updates everytime we create a new inode. With this patch we no longer use up all of our ram with delayed inode updates. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:52 -04:00
David Sterba	1bae30982b	btrfs: add simple debugfs interface Help during debugging to export various interesting infromation and tunables without the need of extra mount options or ioctls. Usage: * declare your variable in sysfs.h, and include where you need it * define the variable in sysfs.c and make it visible via debugfs_create_TYPE Depends on CONFIG_DEBUG_FS. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:51 -04:00
David Sterba	ace0105076	btrfs: send: lower memory requirements in common case The fs_path structure uses an inline buffer and falls back to a chain of allocations, but vmalloc is not necessary because PATH_MAX fits into PAGE_SIZE. The size of fs_path has been reduced to 256 bytes from PAGE_SIZE, usually 4k. Experimental measurements show that most paths on a single filesystem do not exceed 200 bytes, and these get stored into the inline buffer directly, which is now 230 bytes. Longer paths are kmalloced when needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:50 -04:00
Filipe David Borba Manana	dff6d0adbe	Btrfs: make some tree searches in send.c more efficient We have this pattern where we do search for a contiguous group of items in a tree and everytime we find an item, we process it, then we release our path, increment the offset of the search key, do another full tree search and repeat these steps until a tree search can't find more items we're interested in. Instead of doing these full tree searches after processing each item, just process the next item/slot in our leaf and don't release the path. Since all these trees are read only and we always use the commit root for a search and skip node/leaf locks, we're not affecting concurrency on the trees. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:49 -04:00
Filipe David Borba Manana	a0859c0998	Btrfs: use right extent item position in send when finding extent clones This was a leftover from the commit: `74dd17fbe3` (Btrfs: fix btrfs send for inline items and compression) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:48 -04:00
David Sterba	57fb8910c2	btrfs: send: remove BUG_ON from name_cache_delete If cleaning the name cache fails, we could try to proceed at the cost of some memory leak. This is not expected to happen often. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:48 -04:00
David Sterba	4d1a63b21b	btrfs: send: remove BUG from process_all_refs There are only 2 static callers, the BUG would normally be never reached, but let's be nice. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:47 -04:00
David Sterba	1f5a7ff999	btrfs: send: squeeze bitfilelds in fs_path We know that buf_len is at most PATH_MAX, 4k, and can merge it with the reversed member. This saves 3 bytes in favor of inline_buf. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:46 -04:00
David Sterba	e25a812206	btrfs: send: remove virtual_mem member from fs_path We don't need to keep track of that, it's available via is_vmalloc_addr. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:45 -04:00
David Sterba	b23ab57d48	btrfs: send: remove prepared member from fs_path The member is used only to return value back from fs_path_prepare_for_add, we can do it locally and save 8 bytes for the inline_buf path. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:44 -04:00
David Sterba	64792f2535	btrfs: send: replace check with an assert in gen_unique_name The buffer passed to snprintf can hold the fully expanded format string, 64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the check entirely is a good idea, hence the ASSERT. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:44 -04:00
Filipe David Borba Manana	5ed7f9ff15	Btrfs: more send support for parent/child dir relationship inversion The commit titled "Btrfs: fix infinite path build loops in incremental send" didn't cover a particular case where the parent-child relationship inversion of directories doesn't imply a rename of the new parent directory. This was due to a simple logic mistake, a logical and instead of a logical or. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3 $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:43 -04:00
Filipe David Borba Manana	03cb4fb9d8	Btrfs: fix send dealing with file renames and directory moves This fixes a case that the commit titled: Btrfs: fix infinite path build loops in incremental send didn't cover. If the parent-child relationship between 2 directories is inverted, both get renamed, and the former parent has a file that got renamed too (but remains a child of that directory), the incremental send operation would use the file's old path after sending an unlink operation for that old path, causing receive to fail on future operations like changing owner, permissions or utimes of the corresponding inode. This is not a regression from the commit mentioned before, as without that commit we would fall into the issues that commit fixed, so it's just one case that wasn't covered before. Simple steps to reproduce this issue are: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ touch /mnt/btrfs/a/b/c/d/file $ mkdir -p /mnt/btrfs/a/b/x $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2 $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:42 -04:00
Wang Shilong	98cfee2143	Btrfs: only add roots if necessary in find_parent_nodes() find_all_leafs() dosen't need add all roots actually, add roots only if we need, this can avoid unnecessary ulist dance. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:41 -04:00
Hugo Mills	abccd00f8a	btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:40 -04:00
Filipe David Borba Manana	d86477b303	Btrfs: add missing error check in incremental send Function wait_for_parent_move() returns negative value if an error happened, 0 if we don't need to wait for the parent's move, and 1 if the wait is needed. Before this change an error return value was being treated like the return value 1, which was not correct. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:40 -04:00
Miao Xie	c404e0dc2c	Btrfs: fix use-after-free in the finishing procedure of the device replace During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:39 -04:00
Miao Xie	391cd9df81	Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace the alloc list of the filesystem is protected by ->chunk_mutex, we need get that mutex when we insert the new device into the list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:38 -04:00
Kusanagi Kouichi	23ad5b17dc	btrfs: Return EXDEV for cross file system snapshot EXDEV seems an appropriate error if an operation fails bacause it crosses file system boundaries. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:37 -04:00
Miao Xie	827463c49f	Btrfs: don't mix the ordered extents of all files together during logging the inodes There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the log list including those ordered extents that were logged successfully, it would make the log committer not to wait for the completion of the ordered extents. This patch doesn't insert the ordered extents that is about to be logged into a global list, instead, we insert them into a local list. If we log the ordered extents successfully, we splice them with the global list, or we will throw them away, then do full sync. It can also reduce the lock contention and the traverse time of list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:36 -04:00
Thomas Gleixner	f9a8a0abc3	Merge branch 'fortglx/3.15/time' of git://git.linaro.org/people/john.stultz/linux into timers/core - support CLOCK_BOOTTIME clock in timerfd - Add missing header file Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2014-03-10 19:53:09 +01:00
Al Viro	bd2a31d522	get rid of fget_light() instead of returning the flags by reference, we can just have the low-level primitive return those in lower bits of unsigned long, with struct file * derived from the rest. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-03-10 11:44:42 -04:00
Linus Torvalds	9c225f2655	vfs: atomic f_pos accesses as per POSIX Our write() system call has always been atomic in the sense that you get the expected thread-safe contiguous write, but we haven't actually guaranteed that concurrent writes are serialized wrt f_pos accesses, so threads (or processes) that share a file descriptor and use "write()" concurrently would quite likely overwrite each others data. This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says: "2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: [...]" and one of the effects is the file position update. This unprotected file position behavior is not new behavior, and nobody has ever cared. Until now. Yongzhi Pan reported unexpected behavior to Michael Kerrisk that was due to this. This resolves the issue with a f_pos-specific lock that is taken by read/write/lseek on file descriptors that may be shared across threads or processes. Reported-by: Yongzhi Pan <panyongzhi@gmail.com> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-03-10 11:44:41 -04:00
Al Viro	1b56e98990	ocfs2 syncs the wrong range... Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-03-10 11:43:32 -04:00
Linus Torvalds	fe9ea91cde	NFS client bugfixes for Linux 3.14 Highlights include: - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER - Fix an Oopsable delegation callback race - Fix another bad stateid infinite loop - Fail the data server I/O is the stateid represents a lost lock - Fix an Oopsable sunrpc trace event -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJTHJSVAAoJEGcL54qWCgDyVRkP/2t43gjMF6P+Yc7VUW2e5uTv rHhPGFLuDVs9oS3WUYegzvThZMs//ovTaYgUSDNpOYztEB6P8bDRm41q/VgUIixY zWFoEplDgAZZE7gP2EJuXJv3bEdhJqXuCG2KUysqMsaIGlahrlQdHmqGTz6Y931o WROyMWVvnL4IoEtQHVR7DwyqkvSmifPJ8MZZv3Liy82wuw1fCsh8uy8mkYYSbdvN OK4JmHqdJ+CbAZ0WmE4Xe3Itqy/aIMBL9Jyrq4Zl1QX0p7ez3Xpy4XwmtlZXn2KP bKMfK2vP9RggagIpjUL+dhCqxlsyjlF6EzTnQRe7jXqlJ/vJ9pQF8X294jwRysfp 80jDqsTSND4JQiZuBISID23N1nL0TzrP2tWqipR9zx5JJMRVzYZWTzEq4w2uAHgg aW2vTdRNRLZWydlfFNQ8FiuEPIFoQaJFmOCQisec2LtfffLZZBz7JPofjNH9CgU8 mcbPhv75m2imXDOylydiVoD4x/myCGheYw2hpqhb1ZeuQxdN9lnwa0JzjPiP1h38 XIYwzM7TE8WayrdkMDCeIem1dz/VexknfKmXmFXlMfn3GRKxowCSrggxKG92k0eP L35cJj91a9AoxMz/ej0erv0iI1flLeoYP9aJzIRtZf+SB1BZkKhmWlFRQKqnlIOA BzjYui4mUoEQEa5Sk7Th =JfQx -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER - Fix an Oopsable delegation callback race - Fix another bad stateid infinite loop - Fail the data server I/O is the stateid represents a lost lock - Fix an Oopsable sunrpc trace event" * tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Fix oops when trace sunrpc_task events in nfs client NFSv4: Fail the truncate() if the lock/open stateid is invalid NFSv4.1 Fail data server I/O if stateid represents a lost lock NFSv4: Fix the return value of nfs4_select_rw_stateid NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid NFS: Fix a delegation callback race NFSv4: Fix another nfs4_sequence corruptor	2014-03-09 19:17:39 -07:00
Tejun Heo	b7ce40cff0	kernfs: cache atomic_write_len in kernfs_open_file While implementing atomic_write_len, `4d3773c4bb` ("kernfs: implement kernfs_ops->atomic_write_len") moved data copy from userland inside kernfs_get_active() and kernfs_open_file->mutex so that kernfs_ops->atomic_write_len can be accessed before copying buffer from userland; unfortunately, this could lead to locking order inversion involving mmap_sem if copy_from_user() takes a page fault. ====================================================== [ INFO: possible circular locking dependency detected ] 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W ------------------------------------------------------- trinity-c236/10658 is trying to acquire lock: (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 but task is already holding lock: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){++++++}: [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<mm/memory.c:4188>] might_fault+0x7e/0xb0 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 -> #0 (&of->mutex#2){+.+.+.}: [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_sem); lock(&of->mutex#2); lock(&mm->mmap_sem); lock(&of->mutex#2); * DEADLOCK * 1 lock held by trinity-c236/10658: #0: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 stack backtrace: CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000 0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8 ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8 Call Trace: [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0 [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 Fix it by caching atomic_write_len in kernfs_open_file during open so that it can be determined without accessing kernfs_ops in kernfs_fop_write(). This restores the structure of kernfs_fop_write() before `4d3773c4bb` with updated @len determination logic. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/53113485.2090407@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-08 22:08:29 -08:00
Richard Cochran	88391d49ab	kernfs: fix off by one error. The hash values 0 and 1 are reserved for magic directory entries, but the code only prevents names hashing to 0. This patch fixes the test to also prevent hash value 1. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Cc: <stable@vger.kernel.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-08 22:08:29 -08:00
Theodore Ts'o	0bfea8118d	jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-09 00:56:58 -05:00
Theodore Ts'o	6e4862a5bb	jbd2: minimize region locked by j_list_lock in journal_get_create_access() It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-09 00:46:23 -05:00
Theodore Ts'o	d2eb0b9989	jbd2: check jh->b_transaction without taking j_list_lock jh->b_transaction is adequately protected for reading by the jbd_lock_bh_state(bh), so we don't need to take j_list_lock in __journal_try_to_free_buffer(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-09 00:07:19 -05:00
Theodore Ts'o	d4e839d4a9	jbd2: add transaction to checkpoint list earlier We don't otherwise need j_list_lock during the rest of commit phase #7, so add the transaction to the checkpoint list at the very end of commit phase #6. This allows us to drop j_list_lock earlier, which is a good thing since it is a super hot lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-08 22:34:10 -05:00
Theodore Ts'o	42cf3452d5	jbd2: calculate statistics without holding j_state_lock and j_list_lock The two hottest locks, and thus the biggest scalability bottlenecks, in the jbd2 layer, are the j_list_lock and j_state_lock. This has inspired some people to do some truly unnatural things[1]. [1] https://www.usenix.org/system/files/conference/fast14/fast14-paper_kang.pdf We don't need to be holding both j_state_lock and j_list_lock while calculating the journal statistics, so move those calculations to the very end of jbd2_journal_commit_transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-08 19:51:16 -05:00
Theodore Ts'o	3469a32a1e	jbd2: don't hold j_state_lock while calling wake_up() The j_state_lock is one of the hottest locks in the jbd2 layer and thus one of its scalability bottlenecks. We don't need to be holding the j_state_lock while we are calling wake_up(&journal->j_wait_commit), so release the lock a little bit earlier. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-08 19:11:36 -05:00
Theodore Ts'o	df3c1e9a05	jbd2: don't unplug after writing revoke records During commit process, keep the block device plugged after we are done writing the revoke records, until we are finished writing the rest of the commit records in the journal. This will allow most of the journal blocks to be written in a single I/O operation, instead of separating the the revoke blocks from the rest of the journal blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2014-03-08 18:13:52 -05:00
Linus Torvalds	2a75184d52	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Small collection of fixes for 3.14-rc. It contains: - Three minor update to blk-mq from Christoph. - Reduce number of unaligned (< 4kb) in-flight writes on mtip32xx to two. From Micron. - Make the blk-mq CPU notify spinlock raw, since it can't be a sleeper spinlock on RT. From Mike Galbraith. - Drop now bogus BUG_ON() for bio iteration with blk integrity. From Nic Bellinger. - Properly propagate the SYNC flag on requests. From Shaohua" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: add REQ_SYNC early rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world blk-mq: support partial I/O completions blk-mq: merge blk_mq_insert_request and blk_mq_run_request blk-mq: remove blk_mq_alloc_rq mtip32xx: Reduce the number of unaligned writes to 2	2014-03-07 09:59:44 -08:00
Tejun Heo	059499453a	afs: don't use PREPARE_WORK PREPARE_[DELAYED_]WORK() are being phased out. They have few users and a nasty surprise in terms of reentrancy guarantee as workqueue considers work items to be different if they don't have the same work function. afs_call->async_work is multiplexed with multiple work functions. Introduce afs_async_workfn() which invokes afs_call->async_workfn and always use it as the work function and update the users to set the ->async_workfn field instead of overriding the work function using PREPARE_WORK(). It would probably be best to route this with other related updates through the workqueue tree. Compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org	2014-03-07 10:24:50 -05:00
Joe Perches	cb94eb066e	GFS2: Convert gfs2_lm_withdraw to use fs_err vprintk use is not prefixed by a KERN_<LEVEL>, so emit these messages at KERN_ERR level. Using %pV can save some code and allow fs_err to be used, so do it. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-03-07 09:39:18 +00:00
Joe Perches	8382e26b2c	GFS2: Use fs_<level> more often Convert a couple of uses of pr_<level> to fs_<level> Add and use fs_emerg. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-03-07 09:32:31 +00:00
Joe Perches	d77d1b58aa	GFS2: Use pr_<level> more consistently Add pr_fmt, remove embedded "GFS2: " prefixes. This now consistently emits lower case "gfs2: " for each message. Other miscellanea around these changes: o Add missing newlines o Coalesce formats o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-03-07 09:30:51 +00:00
Bob Peterson	a17d758b66	GFS2: Move recovery variables to journal structure in memory If multiple nodes fail and their recovery work runs simultaneously, they would use the same unprotected variables in the superblock. For example, they would stomp on each other's revoked blocks lists, which resulted in file system metadata corruption. This patch moves the necessary variables so that each journal has its own separate area for tracking its journal replay. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2014-03-07 09:14:48 +00:00
Dave Chinner	fe4c224aa1	xfs: inode log reservations are still too small Back in commit `23956703` ("xfs: inode log reservations are too small"), the reservation size was increased to take into account the difference in size between the in-memory BMBT block headers and the on-disk BMDR headers. This solved a transaction overrun when logging the inode size. Recently, however, we've seen a number of these same overruns on kernels with the above fix in it. All of them have been by 4 bytes, so we must still not be accounting for something correctly. Through inspection it turns out the above commit didn't take into account everything it should have. That is, it only accounts for a single log op_hdr structure, when it can actually require up to four op_hdrs - one for each region (log iovec) that is formatted. These regions are the inode log format header, the inode core, and the two forks that can be held in the literal area of the inode. This means we are not accounting for 36 bytes of log space that the transaction can use, and hence when we get inodes in certain formats with particular fragmentation patterns we can overrun the transaction. Fix this by adding the correct accounting for log op_headers in the transaction. Tested-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-03-07 16:19:14 +11:00
Dave Chinner	a49935f200	xfs: xfs_check_page_type buffer checks need help xfs_aops_discard_page() was introduced in the following commit: xfs: truncate delalloc extents when IO fails in writeback ... to clean up left over delalloc ranges after I/O failure in ->writepage(). generic/224 tests for this scenario and occasionally reproduces panics on sub-4k blocksize filesystems. The cause of this is failure to clean up the delalloc range on a page where the first buffer does not match one of the expected states of xfs_check_page_type(). If a buffer is not unwritten, delayed or dirty&mapped, xfs_check_page_type() stops and immediately returns 0. The stress test of generic/224 creates a scenario where the first several buffers of a page with delayed buffers are mapped & uptodate and some subsequent buffer is delayed. If the ->writepage() happens to fail for this page, xfs_aops_discard_page() incorrectly skips the entire page. This then causes later failures either when direct IO maps the range and finds the stale delayed buffer, or we evict the inode and find that the inode still has a delayed block reservation accounted to it. We can easily fix this xfs_aops_discard_page() failure by making xfs_check_page_type() check all buffers, but this breaks xfs_convert_page() more than it is already broken. Indeed, xfs_convert_page() wants xfs_check_page_type() to tell it if the first buffers on the pages are of a type that can be aggregated into the contiguous IO that is already being built. xfs_convert_page() should not be writing random buffers out of a page, but the current behaviour will cause it to do so if there are buffers that don't match the current specification on the page. Hence for xfs_convert_page() we need to: a) return "not ok" if the first buffer on the page does not match the specification provided to we don't write anything; and b) abort it's buffer-add-to-io loop the moment we come across a buffer that does not match the specification. Hence we need to fix both xfs_check_page_type() and xfs_convert_page() to work correctly with pages that have mixed buffer types, whilst allowing xfs_aops_discard_page() to scan all buffers on the page for a type match. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-03-07 16:19:14 +11:00

... 5 6 7 8 9 ...