Commit graph

42873 commits

Author SHA1 Message Date
Al Viro
2b11d80e1a fix d_walk()/non-delayed __d_free() race
commit 3d56c25e3bb0726a5c5e16fc2d9e38f8ed763085 upstream.

Ascend-to-parent logics in d_walk() depends on all encountered child
dentries not getting freed without an RCU delay.  Unfortunately, in
quite a few cases it is not true, with hard-to-hit oopsable race as
the result.

Fortunately, the fix is simiple; right now the rule is "if it ever
been hashed, freeing must be delayed" and changing it to "if it
ever had a parent, freeing must be delayed" closes that hole and
covers all cases the old rule used to cover.  Moreover, pipes and
sockets remain _not_ covered, so we do not introduce RCU delay in
the cases which are the reason for having that delay conditional
in the first place.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:21 -07:00
Jann Horn
9beb96b344 proc: prevent stacking filesystems on top
commit e54ad7f1ee263ffa5a2de9c609d58dfa27b21cd9 upstream.

This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem.  There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.

(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:20 -07:00
Jann Horn
dea2cf7c0c ecryptfs: forbid opening files without mmap handler
commit 2f36db71009304b3f0b95afacd8eba1f9f046b87 upstream.

This prevents users from triggering a stack overflow through a recursive
invocation of pagefault handling that involves mapping procfs files into
virtual memory.

Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:20 -07:00
Minchan Kim
384da356db mm: Support address range reclaim
This patch adds address range reclaim of a process.
The requirement is following as,

Like webkit1, it uses a address space for handling multi tabs.
IOW, it uses *one* process model so all tabs shares address space
of the process. In such scenario, per-process reclaim is rather
coarse-grained so this patch supports more fine-grained reclaim
for being able to reclaim target address range of the process.
For reclaim target range, you should use following format.

	echo [addr] [size-byte] > /proc/pid/reclaim
The addr should be page-aligned.

So now reclaim konb's interface is following as.

echo file > /proc/pid/reclaim
	reclaim file-backed pages only
echo anon > /proc/pid/reclaim
	reclaim anonymous pages only
echo all > /proc/pid/reclaim
	reclaim all pages
echo 0x100000 8K > /proc/pid/reclaim
	reclaim pages in (0x100000 - 0x102000)

Change-Id: I111131d31be1cfcfa246617b634a9a8bc4078098
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 08:39:01
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-06-22 14:44:20 -07:00
Minchan Kim
06de050ac6 mm: Enhance per process reclaim to consider shared pages
Some pages could be shared by several processes. (ex, libc)
In case of that, it's too bad to reclaim them from the beginnig.

This patch causes VM to keep them on memory until last task
try to reclaim them so shared pages will be reclaimed only if
all of task has gone swapping out.

This feature doesn't handle non-linear mapping on ramfs because
it's very time-consuming and doesn't make sure of reclaiming and
not common.

Change-Id: I7e5f34f2e947f5db6d405867fe2ad34863ca40f7
Signed-off-by: Sangseok Lee <sangseok.lee@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:27
[vinmenon@codeaurora.org: trivial merge conflict fixes + changes
to make the patch work with 3.18 kernel]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-06-22 14:43:57 -07:00
Minchan Kim
ecc070a294 mm: Per process reclaim
These day, there are many platforms avaiable in the embedded market
and they are smarter than kernel which has very limited information
about working set so they want to involve memory management more heavily
like android's lowmemory killer and ashmem or recent many lowmemory
notifier.

One of the simple imagine scenario about userspace's intelligence is that
platform can manage tasks as forground and backgroud so it would be
better to reclaim background's task pages for end-user's *responsibility*
although it has frequent referenced pages.

This patch adds new knob "reclaim under proc/<pid>/" so task manager
can reclaim any target process anytime, anywhere. It could give another
method to platform for using memory efficiently.

It can avoid process killing for getting free memory, which was really
terrible experience because I lost my best score of game I had ever
after I switch the phone call while I enjoyed the game.

Reclaim file-backed pages only.
	echo file > /proc/PID/reclaim
Reclaim anonymous pages only.
	echo anon > /proc/PID/reclaim
Reclaim all pages
	echo all > /proc/PID/reclaim

Change-Id: Iabdb7bc2ef3dc4d94e3ea005fbe18f4cd06739ab
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:24
[vinmenon@codeaurora.org: trivial merge conflict fixes,
and minor tweak of the commit msg]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-06-22 14:43:06 -07:00
Srinivas Ramana
5b07573627 security: pfe: Fix the qualifier used to print size_t
Use the correct type qualifier to print size_t
and ssize_t. This will fix the compilation errors when
compiling for ARM. While at it, fix the compilation errors
in pfk_kc.c for sched functions by including sched.h.

Change-Id: I4fac4530dd4b31baf62ef3719535fd662dc2ae37
Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
2016-06-22 14:42:18 -07:00
Nikhilesh Reddy
ddd6e3c830 fs:fuse: Disable passthrough when mmap is called on a file
When some data is written to a file both mmap and regular io
there can be race conditions that can cause incorrect data
to be saved.

Disable passthrough on the specific files on which  mmap is called
until we add mmap support to passthrough.

Change-Id: Ic24219ab22d3130aa7e9e998a9e6798648a7321c
Signed-off-by: Nikhilesh Reddy <reddyn@codeaurora.org>
2016-06-21 15:11:43 -07:00
Alex Shi
9ad8208bd7 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-06-14 17:08:03 +08:00
Dave Chinner
55f6ddfcee xfs: handle dquot buffer readahead in log recovery correctly
commit 7d6a13f023567d573ac362502bb702eda716e654 upstream.

When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.

The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.

Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected.  Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.

This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier.  In either case, this will result in the buffer always
ending up with a valid write verifier on it.

Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Eric Sandeen
063b0dc8b4 xfs: print name of verifier if it fails
commit 233135b763db7c64d07b728a9c66745fb0376275 upstream.

This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it.  Should be a slight debugging aid, I hope.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Dave Chinner
21cfd6cc64 xfs: skip stale inodes in xfs_iflush_cluster
commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e upstream.

We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Dave Chinner
baa7a74d6d xfs: fix inode validity check in xfs_iflush_cluster
commit 51b07f30a71c27405259a0248206ed4e22adbee2 upstream.

Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.

(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>

Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Dave Chinner
7dc8f21bd5 xfs: xfs_iflush_cluster fails to abort on error
commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream.

When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.

Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.

Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Dave Chinner
d7d92ca7dd xfs: Don't wrap growfs AGFL indexes
commit ad747e3b299671e1a53db74963cc6c5f6cdb9f6d upstream.

Commit 96f859d ("libxfs: pack the agfl header structure so
XFS_AGFL_SIZE is correct") allowed the freelist to use the empty
slot at the end of the freelist on 64 bit systems that was not
being used due to sizeof() rounding up the structure size.

This has caused versions of xfs_repair prior to 4.5.0 (which also
has the fix) to report this as a corruption once the filesystem has
been grown. Older kernels can also have problems (seen from a whacky
container/vm management environment) mounting filesystems grown on a
system with a newer kernel than the vm/container it is deployed on.

To avoid this problem, change the initial free list indexes not to
wrap across the end of the AGFL, hence avoiding the initialisation
of agf_fllast to the last index in the AGFL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Eric Sandeen
ec86bfec23 xfs: disallow rw remount on fs with unknown ro-compat features
commit d0a58e833931234c44e515b5b8bede32bd4e6eed upstream.

Today, a kernel which refuses to mount a filesystem read-write
due to unknown ro-compat features can still transition to read-write
via the remount path.  The old kernel is most likely none the wiser,
because it's unaware of the new feature, and isn't using it.  However,
writing to the filesystem may well corrupt metadata related to that
new feature, and moving to a newer kernel which understand the feature
will have problems.

Right now the only ro-compat feature we have is the free inode btree,
which showed up in v3.16.  It would be good to push this back to
all the active stable kernels, I think, so that if anyone is using
newer mkfs (which enables the finobt feature) with older kernel
releases, they'll be protected.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:38 -07:00
Nicolai Stange
8b8de1c929 ext4: silence UBSAN in ext4_mb_init()
commit 935244cd54b86ca46e69bc6604d2adfb1aec2d42 upstream.

Currently, in ext4_mb_init(), there's a loop like the following:

  do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
  } while (i <= sb->s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
  shift exponent 4294967295 is too large for 32-bit type 'int'
  [...]
  Call Trace:
   [<ffffffff818c4d25>] dump_stack+0xbc/0x117
   [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
   [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
   [<ffffffff814293c7>] ? create_cache+0x57/0x1f0
   [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
   [<ffffffff821c2168>] ? mutex_lock+0x38/0x60
   [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
   [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
   [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
   [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
   [...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:37 -07:00
Nicolai Stange
12aa7d95f4 ext4: address UBSAN warning in mb_find_order_for_block()
commit b5cb316cdf3a3f5f6125412b0f6065185240cfdc upstream.

Currently, in mb_find_order_for_block(), there's a loop like the following:

  while (order <= e4b->bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
  }

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
  shift exponent -1 is negative
  [...]
  Call Trace:
   [<ffffffff818c4d35>] dump_stack+0xbc/0x117
   [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
   [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
   [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
   [...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:37 -07:00
Jan Kara
b2601bb015 ext4: fix oops on corrupted filesystem
commit 74177f55b70e2f2be770dd28684dd6d17106a4ba upstream.

When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
 60462947 62219960 602ede24 62219960
 602ede24 603ca293 622198f0 602f02eb
 62219950 6002c12c 62219900 601b4d6b
Call Trace:
 [<6005769c>] ? vprintk_emit+0x2dc/0x5c0
 [<602ede24>] ? printk+0x0/0x94
 [<600190bc>] show_stack+0xdc/0x1a0
 [<602ede24>] ? printk+0x0/0x94
 [<602ede24>] ? printk+0x0/0x94
 [<602f02eb>] dump_stack+0x2a/0x2c
 [<6002c12c>] warn_slowpath_common+0x9c/0xf0
 [<601b4d6b>] ? __list_del_entry+0x6b/0x100
 [<6002c254>] warn_slowpath_fmt+0x94/0xa0
 [<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
 [<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
 [<60023ebf>] ? set_signals+0x3f/0x50
 [<600a205a>] ? kmem_cache_free+0x10a/0x180
 [<602f4e88>] ? mutex_lock+0x18/0x30
 [<601b4d6b>] __list_del_entry+0x6b/0x100
 [<601177ec>] ext4_orphan_del+0x22c/0x2f0
 [<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
 [<6010b973>] ? ext4_truncate+0x383/0x390
 [<6010bc8b>] ext4_write_begin+0x30b/0x4b0
 [<6001bb50>] ? copy_from_user+0x0/0xb0
 [<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
 [<60072c4f>] generic_perform_write+0xaf/0x1e0
 [<600c4166>] ? file_update_time+0x46/0x110
 [<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
 [<6010030f>] ext4_file_write_iter+0x15f/0x470
 [<60094e10>] ? unlink_file_vma+0x0/0x70
 [<6009b180>] ? unlink_anon_vmas+0x0/0x260
 [<6008f169>] ? free_pgtables+0xb9/0x100
 [<600a6030>] __vfs_write+0xb0/0x130
 [<600a61d5>] vfs_write+0xa5/0x170
 [<600a63d6>] SyS_write+0x56/0xe0
 [<6029fcb0>] ? __libc_waitpid+0x0/0xa0
 [<6001b698>] handle_syscall+0x68/0x90
 [<6002633d>] userspace+0x4fd/0x600
 [<6002274f>] ? save_registers+0x1f/0x40
 [<60028bd7>] ? arch_prctl+0x177/0x1b0
 [<60017bd5>] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:37 -07:00
Theodore Ts'o
b2044c3f83 ext4: clean up error handling when orphan list is corrupted
commit 7827a7f6ebfcb7f388dc47fddd48567a314701ba upstream.

Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted.  If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:37 -07:00
Theodore Ts'o
c5ce389844 ext4: fix hang when processing corrupted orphaned inode list
commit c9eb13a9105e2e418f72e46a2b6da3f49e696902 upstream.

If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly).  Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.

This can be reproduced via:

   mke2fs -t ext4 /tmp/foo.img 100
   debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
   mount -o loop /tmp/foo.img /mnt

(But don't do this if you are using an unpatched kernel if you care
about the system staying functional.  :-)

This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1].  (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)

[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:37 -07:00
Willy Tarreau
fa6d0ba12a pipe: limit the per-user amount of pages allocated in pipes
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Moritz Muehlenhoff <moritz@wikimedia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:35 -07:00
Mikulas Patocka
91bb3cf478 affs: fix remount failure when there are no options changed
commit 01d6e08711bf90bc4d7ead14a93a0cbd73b1896a upstream.

Commit c8f33d0bec ("affs: kstrdup() memory handling") checks if the
kstrdup function returns NULL due to out-of-memory condition.

However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL.  In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists.  The mount syscall then fails with
ENOMEM.

This patch fixes the bug.  We fail with ENOMEM only if data is non-NULL.

The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).

Fixes: c8f33d0bec ("affs: kstrdup() memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07 18:14:32 -07:00
Alex Shi
cf5ba83bb2 Merge branch 'lsk-v4.4-android' of git://android.git.linaro.org/kernel/linaro-android into linux-linaro-lsk-v4.4-android 2016-06-02 17:59:02 +08:00
Alex Shi
58189909e6 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-06-02 12:18:57 +08:00
Laura Abbott
a2ac6d40d3 fs/buffer.c: Revoke LRU when trying to drop buffers
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Change-Id: I253f4ee2069e190c1115afc421dadd27a7fa87dc
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-06-01 15:20:51 -07:00
Mikulas Patocka
7e920411dd hpfs: implement the show_options method
commit 037369b872940cd923835a0a589763180c4a36bc upstream.

The HPFS filesystem used generic_show_options to produce string that is
displayed in /proc/mounts.  However, there is a problem that the options
may disappear after remount.  If we mount the filesystem with option1
and then remount it with option2, /proc/mounts should show both option1
and option2, however it only shows option2 because the whole option
string is replaced with replace_mount_options in hpfs_remount_fs.

To fix this bug, implement the hpfs_show_options function that prints
options that are currently selected.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:54 -07:00
Mikulas Patocka
5cb3ec3d60 hpfs: fix remount failure when there are no options changed
commit 44d51706b4685f965cd32acde3fe0fcc1e6198e8 upstream.

Commit ce657611ba ("hpfs: kstrdup() out of memory handling") checks if
the kstrdup function returns NULL due to out-of-memory condition.

However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL.  In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists.  The mount syscall then fails with
ENOMEM.

This patch fixes the bug.  We fail with ENOMEM only if data is non-NULL.

The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).

Fixes: ce657611ba ("hpfs: kstrdup() out of memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:54 -07:00
Stefan Metzmacher
6b83512b37 fs/cifs: correctly to anonymous authentication for the NTLM(v2) authentication
commit 1a967d6c9b39c226be1b45f13acd4d8a5ab3dc44 upstream.

Only server which map unknown users to guest will allow
access using a non-null NTLMv2_Response.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Stefan Metzmacher
0e5e5bfd9b fs/cifs: correctly to anonymous authentication for the NTLM(v1) authentication
commit 777f69b8d26bf35ade4a76b08f203c11e048365d upstream.

Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Stefan Metzmacher
4dc809685c fs/cifs: correctly to anonymous authentication for the LANMAN authentication
commit fa8f3a354bb775ec586e4475bcb07f7dece97e0c upstream.

Only server which map unknown users to guest will allow
access using a non-null LMChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Stefan Metzmacher
9ad66e1474 fs/cifs: correctly to anonymous authentication via NTLMSSP
commit cfda35d98298131bf38fbad3ce4cd5ecb3cf18db upstream.

See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client:

   ...
   Set NullSession to FALSE
   If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND
      AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND
      (AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1)
       OR
       AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0))
       -- Special case: client requested anonymous authentication
       Set NullSession to TRUE
   ...

Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.

For Samba it's the "map to guest = bad user" option.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Steve French
b7d7ba3115 remove directory incorrectly tries to set delete on close on non-empty directories
commit 897fba1172d637d344f009d700f7eb8a1fa262f1 upstream.

Wrong return code was being returned on SMB3 rmdir of
non-empty directory.

For SMB3 (unlike for cifs), we attempt to delete a directory by
set of delete on close flag on the open. Windows clients set
this flag via a set info (SET_FILE_DISPOSITION to set this flag)
which properly checks if the directory is empty.

With this patch on smb3 mounts we correctly return
 "DIRECTORY NOT EMPTY"
on attempts to remove a non-empty directory.

Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Eryu Guan
fa5613b1f3 ext4: iterate over buffer heads correctly in move_extent_per_page()
commit 6ffe77bad545f4a7c8edd2a4ee797ccfcd894ab4 upstream.

In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.

Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:48 -07:00
Josef Bacik
e1ce8c2a29 Btrfs: don't use src fd for printk
commit c79b4713304f812d3d6c95826fc3e5fc2c0b0c14 upstream.

The fd we pass in may not be on a btrfs file system, so don't try to do
BTRFS_I() on it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01 12:15:47 -07:00
Alex Shi
023861726f Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-05-20 12:16:40 +08:00
Janis Danisevskis
1a6976bead UPSTREAM: procfs: fixes pthread cross-thread naming if !PR_DUMPABLE
The PR_DUMPABLE flag causes the pid related paths of the
proc file system to be owned by ROOT. The implementation
of pthread_set/getname_np however needs access to
/proc/<pid>/task/<tid>/comm.
If PR_DUMPABLE is false this implementation is locked out.

This patch installs a special permission function for
the file "comm" that grants read and write access to
all threads of the same group regardless of the ownership
of the inode. For all other threads the function falls back
to the generic inode permission check.

Signed-off-by: Janis Danisevskis <jdanis@google.com>
2016-05-19 12:35:13 +05:30
Daniel Rosenberg
28813390bf fuse: Add support for d_canonical_path
Allows FUSE to report to inotify that it is acting
as a layered filesystem. The userspace component
returns a string representing the location of the
underlying file. If the string cannot be resolved
into a path, the top level path is returned instead.

bug: 23904372
Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2016-05-19 12:35:13 +05:30
Daniel Rosenberg
6fa18ffc71 vfs: change d_canonical_path to take two paths
bug: 23904372
Change-Id: I4a686d64b6de37decf60019be1718e1d820193e6
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2016-05-19 12:35:13 +05:30
Al Viro
007796c01f get_rock_ridge_filename(): handle malformed NM entries
commit 99d825822eade8d827a1817357cbf3f889a552d6 upstream.

Payloads of NM entries are not supposed to contain NUL.  When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries).  We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254.  So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected.  And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily.  And that's what will be passed to readdir callback as the
name length.  8Kb __copy_to_user() from a buffer allocated by
__get_free_page()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:54 -07:00
Al Viro
4549fc7128 atomic_open(): fix the handling of create_error
commit 10c64cea04d3c75c306b3f990586ffb343b63287 upstream.

* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT.  Which is what
the current code ends up returning.

* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here.  At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen.  No need
to do that inside atomic_open().

The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error.  No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:51 -07:00
Miklos Szeredi
8e5bb3c541 vfs: rename: check backing inode being equal
commit 9409e22acdfc9153f88d9b1ed2bd2a5b34d2d3ca upstream.

If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.

This condition is checked in vfs_rename().  However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.

Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).

The correct thing to do is to detect this condition before even calling
into overlayfs.  This patch does this by calling vfs_select_inode() to get
the underlying inodes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:49 -07:00
Miklos Szeredi
b0dac61d24 vfs: add vfs_select_inode() helper
commit 54d5ca871e72f2bb172ec9323497f01cd5091ec7 upstream.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:48 -07:00
Junxiao Bi
dc3e6de00b ocfs2: fix posix_acl_create deadlock
commit c25a1e0671fbca7b2c0d0757d533bd2650d6dc0c upstream.

Commit 702e5bc68a ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create.  The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs.  For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again.  This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted.  And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.

Fixes: 702e5bc68a ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:44 -07:00
Junxiao Bi
3cbabd4e83 ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
commit 5ee0fbd50fdf1c1329de8bee35ea9d7c6a81a2e0 upstream.

Commit 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue.  ocfs2_setattr called by chmod command holds
cluster wide inode lock when calling posix_acl_chmod.  This latter
function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.  These
two are also called directly from vfs layer for getfacl/setfacl commands
and therefore acquire the cluster wide inode lock.  If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set.  And this will cause the second call to
inode lock from the ocfs2_iop_get_acl() to block indefinetly.

The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem.  Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.

Fixes: 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:43 -07:00
Alex Shi
b3f09bff3f Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-05-12 12:20:40 +08:00
Alex Shi
334ca3ed18 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2016-05-12 09:27:18 +08:00
Eric W. Biederman
b17580a3cb propogate_mnt: Handle the first propogated copy being a slave
commit 5ec0811d30378ae104f250bfc9b3640242d81e3f upstream.

When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>]  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38  EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS:  00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
>  ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
>  ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
>  0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
>  [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
>  [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
>  [<ffffffff811f1ec3>] graft_tree+0x63/0x70
>  [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
>  [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
>  [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
>  [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
>  [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
>  [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
>  RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

fixes: f2ebb3a921 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11 11:21:19 +02:00
Maxim Patlasov
ddd5c3139d fs/pnode.c: treat zero mnt_group_id-s as unequal
commit 7ae8fd0351f912b075149a1e03a017be8b903b9a upstream.

propagate_one(m) calculates "type" argument for copy_tree() like this:

>    if (m->mnt_group_id == last_dest->mnt_group_id) {
>        type = CL_MAKE_SHARED;
>    } else {
>        type = CL_SLAVE;
>        if (IS_MNT_SHARED(m))
>           type |= CL_MAKE_SHARED;
>   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:

>    last_source = last_source->mnt_master;
>    last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

[AV: don't open-code an obvious helper...]

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11 11:21:19 +02:00
Mathias Krause
898149d10b proc: prevent accessing /proc/<PID>/environ until it's ready
commit 8148a73c9901a8794a50f950083c00ccf97d43b3 upstream.

If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.

Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero.  It is, apparently, intentionally set last in create_*_tables().

This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.

The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.

Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-11 11:21:16 +02:00