If fault injection functionality is enabled, show additional injection
rate in ->show_options.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
err must be set to -ENOMEM, otherwise we return 0.
Fixes: a912b54d3aaa0 ("f2fs: split bio cache")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It is better to use variable name "inline_dentry" instead of "dentry_blk"
when data type is "struct f2fs_inline_dentry". This patch has no functional
changes, just to make code more readable especially when call the function
make_dentry_ptr_inline() and f2fs_convert_inline_dir().
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With fault_injection option, generic/361 of fstests will complain us
with below message:
Call Trace:
get_node_page+0x12/0x20 [f2fs]
f2fs_iget+0x92/0x7d0 [f2fs]
f2fs_fill_super+0x10fb/0x15e0 [f2fs]
mount_bdev+0x184/0x1c0
f2fs_mount+0x15/0x20 [f2fs]
mount_fs+0x39/0x150
vfs_kern_mount+0x67/0x110
do_mount+0x1bb/0xc70
SyS_mount+0x83/0xd0
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
Since mkfs loop device in f2fs partition can be failed silently due to
checkpoint error injection, so root inode page can be corrupted, in order
to avoid needless panic, in get_node_page, it's better to leave message
and return error to caller, and let fsck repaire it later.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We will never persist newly allocated nat entries during checkpoint(), so
we don't need to track such nat entries in nat dirty list in order to
avoid:
- more latency during traversing dirty list;
- sorting nat sets incorrectly due to recording wrong entry_cnt in nat
entry set.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Recently, discard related codes have changed a lot, so add f2fs_bug_on to
detect potential bug.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently in F2FS, page faults and operations that truncate the pagecahe
or data blocks, are completely unsynchronized. This can result in page
fault faulting in a page into a range that we are changing after
truncating, and thus we can end up with a page mapped to disk blocks that
will be shortly freed. Filesystem corruption will shortly follow.
This patch fixes the problem by creating new rw semaphore i_mmap_sem in
f2fs_inode_info and grab it for functions removing blocks from extent tree
and for read over page faults. The mechanism is similar to that in ext4.
Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Conflicts:
fs/f2fs/file.c
The index of segment which the next nat block is in has only one different
bit than the current one, so to get the next nat address, we can simply
alter that one bit.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make sure number of entires doesn't exceed max journal size.
Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@android.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Mount fs with option noflush_merge, boot failed for illegal address
fcc in function f2fs_issue_flush:
if (!test_opt(sbi, FLUSH_MERGE)) {
ret = submit_flush_wait(sbi);
atomic_inc(&fcc->issued_flush); -> Here, fcc illegal
return ret;
}
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's not necessary to specify 'int' casting for PTR_ERR.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For example,
f2fs_create
- new_node_page is failed
- handle_failed_inode
- skip to add it into orphan list, since ni.blk_addr == NULL_ADDR
: set_inode_flag(inode, FI_FREE_NID)
f2fs_evict_inode
- EIO due to fault injection
- f2fs_bug_on() is triggered
So, we don't need to call f2fs_bug_on in this case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
clear_prefree_segments() issues small discards after discarding full
segments. These small discards may not be section aligned, so not zone
aligned on a zoned block device, causing __f2fs_iissue_discard_zone() to fail.
Fix this by not issuing small discards for a volume mounted with the BLKZONED
feature enabled.
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().
This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().
This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.
Fixes: ec9895add2c5 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add a new method get_ssr_cost to select
SSR segment more accurately.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.
Process A(cp): Process B:
f2fs_lock_all(sbi)
get_dnode_of_data <---- lock dn.node_page
flush_nodes f2fs_lock_op
So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Conflicts:
include/trace/events/f2fs.h
Merged IO flow doesn't need to care about read IOs.
f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make sure segno and blkoff read from raw image are valid.
Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As I noted on the mailing list, it's easier than I originally thought to
create intentional collisions in the digested names. Unfortunately it's
not too easy to solve this, so for now just fix the comment to not lie.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Switch f2fs directory searches to use the fscrypt_match_name() helper
function. There should be no functional change.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Introduce a helper function fscrypt_match_name() which tests whether a
fscrypt_name matches a directory entry. Also clean up the magic numbers
and document things properly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce). However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys. This can cause two
incorrect behaviors:
1. If we have the parent directory's key but not the child's key, or
vice versa, then fscrypt_has_permitted_context() returned false,
causing applications to see EPERM or ENOKEY. This is incorrect if
the encryption contexts are in fact consistent. Although we'd
normally have either both keys or neither key in that case since the
master_key_descriptors would be the same, this is not guaranteed
because keys can be added or removed from keyrings at any time.
2. If we have neither the parent's key nor the child's key, then
fscrypt_has_permitted_context() returned true, causing applications
to see no error (or else an error for some other reason). This is
incorrect if the encryption contexts are in fact inconsistent, since
in that case we should deny access.
To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.
While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info(). If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit exposes the necessary constants and structures for a
userspace program to pass filesystem encryption keys into the keyring.
The fscrypt_key structure was already part of the kernel ABI, this
change just makes it so programs no longer have to redeclare these
structures (like e4crypt in e2fsprogs currently does).
Note that we do not expose the other FS_*_KEY_SIZE constants as they are
not necessary. Only XTS is supported for contents_encryption_mode, so
currently FS_MAX_KEY_SIZE bytes of key material must always be passed to
the kernel.
This commit also removes __packed from fscrypt_key as it does not
contain any implicit padding and does not refer to an on-disk structure.
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fscrypt_symlink_data_len() is never called and can be removed.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The functions in fs/crypto/*.c are only called by filesystems configured
with encryption support. Since the ->get_context(), ->set_context(),
and ->empty_dir() operations are always provided in that case (and must
be, otherwise there would be no way to get/set encryption policies, or
in the case of ->get_context() even access encrypted files at all),
there is no need to check for these operations being NULL and we can
remove these unneeded checks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The only use of the ->prepare_context() fscrypt operation was to allow
ext4 to evict inline data from the inode before ->set_context().
However, there is no reason why this cannot be done as simply the first
step in ->set_context(), and in fact it makes more sense to do it that
way because then the policy modes and flags get validated before any
real work is done. Therefore, merge ext4_prepare_context() into
ext4_set_context(), and remove ->prepare_context().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Conflicts:
fs/ext4/super.c
Filesystem encryption ostensibly supported revoking a keyring key that
had been used to "unlock" encrypted files, causing those files to become
"locked" again. This was, however, buggy for several reasons, the most
severe of which was that when key revocation happened to be detected for
an inode, its fscrypt_info was immediately freed, even while other
threads could be using it for encryption or decryption concurrently.
This could be exploited to crash the kernel or worse.
This patch fixes the use-after-free by removing the code which detects
the keyring key having been revoked, invalidated, or expired. Instead,
an encrypted inode that is "unlocked" now simply remains unlocked until
it is evicted from memory. Note that this is no worse than the case for
block device-level encryption, e.g. dm-crypt, and it still remains
possible for a privileged user to evict unused pages, inodes, and
dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
simply unmounting the filesystem. In fact, one of those actions was
already needed anyway for key revocation to work even somewhat sanely.
This change is not expected to break any applications.
In the future I'd like to implement a real API for fscrypt key
revocation that interacts sanely with ongoing filesystem operations ---
waiting for existing operations to complete and blocking new operations,
and invalidating and sanitizing key material and plaintext from the VFS
caches. But this is a hard problem, and for now this bug must be fixed.
This bug affected almost all versions of ext4, f2fs, and ubifs
encryption, and it was potentially reachable in any kernel configured
with encryption support (CONFIG_EXT4_ENCRYPTION=y,
CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
CONFIG_UBIFS_FS_ENCRYPTION=y). Note that older kernels did not use the
shared fs/crypto/ code, but due to the potential security implications
of this bug, it may still be worthwhile to backport this fix to them.
Fixes: b7236e21d5 ("ext4 crypto: reorganize how we store keys in the inode")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Michael Halcrow <mhalcrow@google.com>
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes. Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long. Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.
However, there is a bug. It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions. However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped". Thus, it's actually the
second-to-last block which depends on the full plaintext.
This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable. For example, with ext4:
# echo pass | e4crypt add_key -p 16 edir/
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
2004
# rm -rf edir/
rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
...
To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.
Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient. This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations. Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories. They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.
For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet. Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.
Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.
Eric reported how to reproduce this issue by:
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999
Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Conflicts:
fs/f2fs/inline.c
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it. Also add the same
warning message that ext4 uses.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
-write_checkpoint
-do_checkpoint
-next_free_nid <--- something wrong with next free nid
-f2fs_fill_super
-build_node_manager
-build_free_nids
-get_current_nat_page
-__get_meta_page <--- attempt to access beyond end of device
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>