From: Yigal Korman <yigal@plexistor.com>
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.
A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Modifies htree_dirblock_to_tree, dx_make_map, ext4_match search_dir,
and ext4_find_dest_de to support fname crypto. Filename encryption
feature is not yet enabled at this patch.
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For encrypted directories, we need to pass in a separate parameter for
the decrypted filename, since the directory entry contains the
encrypted filename.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pulls block_write_begin() into fs/ext4/inode.c because it might need
to do a low-level read of the existing data, in which case we need to
decrypt it.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Enforce the following inheritance policy:
1) An unencrypted directory may contain encrypted or unencrypted files
or directories.
2) All files or directories in a directory must be protected using the
same key as their containing directory.
As a result, assuming the following setup:
mke2fs -t ext4 -Fq -O encrypt /dev/vdc
mount -t ext4 /dev/vdc /vdc
mkdir /vdc/a /vdc/b /vdc/c
echo foo | e4crypt add_key /vdc/a
echo bar | e4crypt add_key /vdc/b
for i in a b c ; do cp /etc/motd /vdc/$i/motd-$i ; done
Then we will see the following results:
cd /vdc
mv a b # will fail; /vdc/a and /vdc/b have different keys
mv b/motd-b a # will fail, see above
ln a/motd-a b # will fail, see above
mv c a # will fail; all inodes in an encrypted directory
# must be encrypted
ln c/motd-c b # will fail, see above
mv a/motd-a c # will succeed
mv c/motd-a a # will succeed
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
On encrypt, we will re-assign the buffer_heads to point to a bounce
page rather than the control_page (which is the original page to write
that contains the plaintext). The block I/O occurs against the bounce
page. On write completion, we re-assign the buffer_heads to the
original plaintext page.
On decrypt, we will attach a read completion callback to the bio
struct. This read completion will decrypt the read contents in-place
prior to setting the page up-to-date.
The current encryption mode, AES-256-XTS, lacks cryptographic
integrity. AES-256-GCM is in-plan, but we will need to devise a
mechanism for handling the integrity data.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This takes code from fs/mpage.c and optimizes it for ext4. Its
primary reason is to allow us to more easily add encryption to ext4's
read path in an efficient manner.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously commit 14ece1028b added a
support for for syncing parent directory of newly created inodes to
make sure that the inode is not lost after a power failure in
no-journal mode.
However this does not work in majority of cases, namely:
- if the directory has inline data
- if the directory is already indexed
- if the directory already has at least one block and:
- the new entry fits into it
- or we've successfully converted it to indexed
So in those cases we might lose the inode entirely even after fsync in
the no-journal mode. This also includes ext2 default mode obviously.
I've noticed this while running xfstest generic/321 and even though the
test should fail (we need to run fsck after a crash in no-journal mode)
I could not find a newly created entries even when if it was fsynced
before.
Fix this by adjusting the ext4_add_entry() successful exit paths to set
the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the
parent directory as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: stable@vger.kernel.org
When xfstests' auto group is run on a bigalloc filesystem with a
4.0-rc3 kernel, e2fsck failures and kernel warnings occur for some
tests. e2fsck reports incorrect iblocks values, and the warnings
indicate that the space reserved for delayed allocation is being
overdrawn at allocation time.
Some of these errors occur because the reserved space is incorrectly
decreased by one cluster when ext4_ext_map_blocks satisfies an
allocation request by mapping an unused portion of a previously
allocated cluster. Because a cluster's worth of reserved space was
already released when it was first allocated, it should not be released
again.
This patch appears to correct the e2fsck failure reported for
generic/232 and the kernel warnings produced by ext4/001, generic/009,
and generic/033. Failures and warnings for some other tests remain to
be addressed.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_zero_range(), removing a file's entire block range from the
extent status tree removes all records of that file's delalloc extents.
The delalloc accounting code uses this information, and its loss can
then lead to accounting errors and kernel warnings at writeback time and
subsequent file system damage. This is most noticeable on bigalloc
file systems where code in ext4_ext_map_blocks() handles cases where
delalloc extents share clusters with a newly allocated extent.
Because we're not deleting a block range and are correctly updating the
status of its associated extent, there is no need to remove anything
from the extent status tree.
When this patch is combined with an unrelated bug fix for
ext4_zero_range(), kernel warnings and e2fsck errors reported during
xfstests runs on bigalloc filesystems are greatly reduced without
introducing regressions on other xfstests-bld test scenarios.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is a bug in zero range code which causes zero range
calls to only allocate block aligned portion of the range, while
ignoring the rest in some cases.
In some cases, namely if the end of the range is past i_size, we do
attempt to preallocate the last nonaligned block. However this might
cause kernel to BUG() in some carefully designed zero range requests
on setups where page size > block size.
Fix this problem by first preallocating the entire range, including
the nonaligned edges and converting the written extents to unwritten
in the next step. This approach will also give us the advantage of
having the range to be as linearly contiguous as possible.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This is a leftover of commit 71d4f7d032
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
bdi->dev now never goes away, so this function became useless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In this if statement, the previous condition is useless, the later one
has covered it.
Signed-off-by: Weiyuan <weiyuan.wei@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Remove unused header files and header files which are included in
ext4.h.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since commit a9b8241594, we are allowed to merge unwritten extents,
so here these comments are wrong, remove it.
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
According to C99, %*.s means the same as %*.0s, in other words, print as
many spaces as the field width argument says and effectively ignore the
string argument. That is certainly not what was meant here. The kernel's
printf implementation, however, treats it as if the . was not there,
i.e. as %*s. I don't know if de->name is nul-terminated or not, but in
any case I'm guessing the intention was to use de->name_len as precision
instead of field width.
[ Note: this is debugging code which is commented out, so this is not
security issue; a developer would have to explicitly enable
INLINE_DIR_DEBUG before this would be an issue. ]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Release references to buffer-heads if ext4_journal_start() fails.
Fixes: 5b61de7575 ("ext4: start handle at least possible moment when renaming files")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create new internal interface for getting information about quota which
contains everything needed for both VFS quotas and XFS quotas. Make VFS
use this and hook it up to Q_GETINFO.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
and read-only images (for which the implementation is mostly just the
reserved code point for a read-only feature :-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU6lssAAoJENNvdpvBGATwpsEQAOcpCqj0gp/istbsoFpl5v5K
+BU2aPvR5CPLtUQz9MqrVF5/6zwDbHGN+GIB6CEmh/qHIVQAhhS4XR+opSc7qqUr
fAQ1AhL5Oh8Dyn9DRy5Io8oRv+wo5lRdD7aG7SPiizCMRQ34JwJ2sWIAwbP2Ea7W
Xg51v3LWEu+UpqpgY3YWBoJKHj4hXwFvTVOCHs94239Y2zlcg2c4WwbKPzkvPcV/
TvvZOOctty+l3FOB2bqFj3VnvywQmNv8/OixKjSprxlR7nuQlhKaLTWCtRjFbND4
J/rk2ls5Bl79dnMvyVfV5ghpmGYBf5kkXCP716YsQkRCZUfNVrTOPJrNHZtYilAb
opRo2UjAyTWxZBvyssnCorHJZUdxlYeIuSTpaG0zUbR0Y6p/7qd31F5k41GbBCFf
B0lV3IaiVnXk23S2jFVHGhrzoKdFqu30tY7LMaO4xyGVMigOZJyBu8TZ7Utj9HmW
/4GfjlvYqlfB7p+6yBkDv/87hjdmfMWIw48A7xWCiIeguQhB79gwTV7uAHVtgfng
h5RF2EH/fx5klbAZx9vlaAh3pGFBHbh9fkeBmW9qNm7glz7aMUuxQaSo6X8HrCAJ
LrECgDGbuiOHnMYuzZRERZiqwLB7JT82C1xopGzefsE/i0kN1eMjITkfggjQ5whu
caLPn49tAb9U8P6TsPeE
=PF+t
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Ext4 bug fixes.
We also reserved code points for encryption and read-only images (for
which the implementation is mostly just the reserved code point for a
read-only feature :-)"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix indirect punch hole corruption
ext4: ignore journal checksum on remount; don't fail
ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
ext4: support read-only images
ext4: change to use setup_timer() instead of init_timer()
ext4: reserve codepoints used by the ext4 encryption feature
jbd2: complain about descriptor block checksum errors
Pull lazytime mount option support from Al Viro:
"Lazytime stuff from tytso"
* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ext4: add optimization for the lazytime mount option
vfs: add find_inode_nowait() function
vfs: add support for a lazytime mount option
This is a port of the DAX functionality found in the current version of
ext2.
[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4f579ae7de (ext4: fix punch hole on files with indirect
mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect
mapping. However, there are bugs in several corner cases. This fixes 5
distinct bugs:
1. When there is at least one entire level of indirection between the
start and end of the punch range and the end of the punch range is the
first block of its level, we can't return early; we have to free the
intervening levels.
2. When the end is at a higher level of indirection than the start and
ext4_find_shared returns a top branch for the end, we still need to free
the rest of the shared branch it returns; we can't decrement partial2.
3. When a punch happens within one level of indirection, we need to
converge on an indirect block that contains the start and end. However,
because the branches returned from ext4_find_shared do not necessarily
start at the same level (e.g., the partial2 chain will be shallower if
the last block occurs at the beginning of an indirect group), the walk
of the two chains can end up "missing" each other and freeing a bunch of
extra blocks in the process. This mismatch can be handled by first
making sure that the chains are at the same level, then walking them
together until they converge.
4. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the start, we must free it,
but only if the end does not occur within that branch.
5. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the end, then we shouldn't
free the block referenced by the end of the returned chain (this mirrors
the different levels case).
Signed-off-by: Omar Sandoval <osandov@osandov.com>
As of v3.18, ext4 started rejecting a remount which changes the
journal_checksum option.
Prior to that, it was simply ignored; the problem here is that
if someone has this in their fstab for the root fs, now the box
fails to boot properly, because remount of root with the new options
will fail, and the box proceeds with a readonly root.
I think it is a little nicer behavior to accept the option, but
warn that it's being ignored, rather than failing the mount,
but that might be a subjective matter...
Reported-by: Cónräd <conradsand.arma@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
rejection of, changing journal_checksum during remount. One suffices.
While we're at it, remove old comment about the "check" option
which has been deprecated for some time now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>