commit 2040fce83fe17763b07c97c1f691da2bb85e4135 upstream.
Previous mkfs.f2fs allows small partition inappropriately, so f2fs should detect
that as well.
Refer this in f2fs-tools.
mkfs.f2fs: detect small partition by overprovision ratio and # of segments
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit f455c8a5f0a24090e99249eb7280012376adec2c upstream.
The sync_fs in f2fs_balance_fs_bg must avoid interrupting current user requests.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 204706c7accfabb67b97eef9f9a28361b6201199 upstream.
This reverts commit 1beba1b3a953107c3ff5448ab4e4297db4619c76.
The perpcu_counter doesn't provide atomicity in single core and consume more
DRAM. That incurs fs_mark test failure due to ENOMEM.
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 0002b61bdaac732bcff364a18f5bd57c95def0a5 upstream.
We should use AOP_WRITEPAGE_ACTIVATE when we bypass writing pages.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 26787236b36660baf4d136281d40b5bb33a570ec upstream.
If a file needs to keep its i_size by fallocate, we need to turn off auto
recovery during roll-forward recovery.
This will resolve the below scenario.
1. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4096" -c "fsync"
2. xfs_io -f /mnt/f2fs/file -c "falloc -k 4096 4096" -c "fsync"
3. md5sum /mnt/f2fs/file;
4. godown /mnt/f2fs/
5. umount /mnt/f2fs/
6. mount -t f2fs /dev/sdx /mnt/f2fs
7. md5sum /mnt/f2fs/file
Reported-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 19c526515f6b998039d5d71fea879d255f173746 upstream.
The addition of multiple-device support broke CONFIG_BLK_DEV_ZONED
on 32-bit machines because of a 64-bit division:
fs/f2fs/f2fs.o: In function `__issue_discard_async':
extent_cache.c:(.text.__issue_discard_async+0xd4): undefined reference to `__aeabi_uldivmod'
Fortunately, bdev_zone_size() is guaranteed to return a power-of-two
number, so we can replace the % operator with a cheaper bit mask.
Fixes: 792b84b74b54 ("f2fs: support multiple devices")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit b08b12d2ddc85b977a0531470cf6a7158289aaaf upstream.
While calculating inode count that we can create at most in the left space,
we should consider space which data/node blocks occupied, since we create
data/node mixly in main area. So fix the wrong calculation in ->statfs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 97dd26ad834739d4e4ea35fd7ab5f92824de4cbb upstream.
If i_size is not aligned to the f2fs's block size, we should not skip inode
update during fsync.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 3a3a5ead7b6d2c9a29f493791ba23f264052db34 upstream.
If i_size is already valid during roll_forward recovery, we should not update
it according to the block alignment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 281518c694a5228d6c46fac83529fb3e2c331281 upstream.
For below two cases, we can't guarantee data consistence:
a)
1. xfs_io "pwrite 0 4195328" "fsync"
2. xfs_io "pwrite 4195328 1024" "fdatasync"
3. godown
4. umount & mount
--> isize we updated before fdatasync won't be recovered
b)
1. xfs_io "pwrite -S 0xcc 0 4202496" "fsync"
2. xfs_io "fpunch 4194304 4096" "fdatasync"
3. godown
4. umount & mount
--> dnode we punched before fdatasync won't be recovered
The reason is that normally fdatasync won't be aware of modification
of metadata in file, e.g. isize changing, dnode updating, so in ->fsync
we will skip flushing node pages for above cases, result in making
fdatasynced file being lost during recovery.
Currently we have introduced DIRTY_META global list in sbi for tracking
dirty inode selectively, so in fdatasync we can choose to flush nodes
depend on dirty state of current inode in the list.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 04d47e673863c637a2b44ad34a558aeb5d0a727e upstream.
Thread A Thread B Thread C
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
alloc last nid
- f2fs_unlock_op
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
as node count still not
be increased, we will
loop in alloc_nid
- f2fs_write_node_pages
- f2fs_balance_fs_bg
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- f2fs_lock_op
While creating new inode, we do not allocate and account nid atomically,
so that when there is almost no free nids left, we may encounter deadloop
like above stack.
In order to avoid that, reuse nm_i::available_nids for accounting free nids
and make nid allocation and counting being atomical during node creation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit d40a43af0a57a017eba9ad2679183791587ceb6a upstream.
Thread A Thread B
- write_checkpoint
- block_operations
-blk_start_plug
-sync_node_pages - f2fs_do_sync_file
- fsync_node_pages
- f2fs_wait_on_page_writeback
Thread A wait for global F2FS_DIRTY_NODES decreased to zero,
it start a plug list, some requests have been added to this list.
Thread B lock one dirty node page, and wait this page write back.
But this page has been in plug list of thread A with PG_writeback flag.
Thread A keep on running and its plug list has no chance to finish,
so it seems a deadlock between cp and fsync path.
This patch add a wait on page write back before set node page dirty
to avoid this problem.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 36951b38d13ac7cce9fcf89e0e01c22ed0d05688 upstream.
Normally, while committing checkpoint, we will wait on all pages to be
writebacked no matter the page is data or metadata, so in scenario where
there are lots of data IO being submitted with metadata, we may suffer
long latency for waiting writeback during checkpoint.
Indeed, we only care about persistence for pages with metadata, but not
pages with data, as file system consistent are only related to metadate,
so in order to avoid encountering long latency in above scenario, let's
recognize and reference metadata in submitted IOs, wait writeback only
for metadatas.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit c79b7ff1d3c7710c23d8828a69d8cbc5597ad19f upstream.
Previously, written_valid_blocks was got by ckpt->valid_block_count. But if
the last checkpoint has some NEW_ADDR due to power-cut, we can get wrong value.
Fix it to get the number from actual written block count from sit entries.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 7702bdbe505a22380dd958e2ee35124c7c414806 upstream.
If many threads hit has_not_enough_free_secs() in f2fs_balance_fs() at the same
time, all the threads would do FG_GC or BG_GC.
In this critical path, we totally don't need to do BG_GC at all.
Let's avoid that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit c040ff9d69fd1d782fe577ba9e35c1f5798158ae upstream.
In direct_IO path of f2fs_file_write_iter(),
1. f2fs_preallocate_blocks(F2FS_GET_BLOCK_PRE_DIO)
-> allocate LBA X
2. f2fs_direct_IO()
-> return 0;
Then,
f2fs_write_data_page() will allocate another LBA X+1.
This makes EIO triggered by HM-SMR.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 3c62be17d4f562f43fe1d03b48194399caa35aa5 upstream.
This patch implements multiple devices support for f2fs.
Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
volume under one f2fs instance.
Internal block management is very simple, but we will modify block allocation
and background GC policy to boost IO speed by exploiting them accoording to
each device speed.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit e57e9ae5b179a6b243c42bf6d9549d1595c27089 upstream.
We can allow dio reads for LFS mode, while doing buffered writes for dio writes.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 6ae1be13e85f4c42c8ca371fda50ae39eebbfd96 upstream.
Now we don't need to be too much careful about storage alignment for dio, since
its speed becomes quite fast and we'd better avoid any misalignment first.
Revert: 38aa0889b2 (f2fs: align direct_io'ed data to section)
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 206147112860418fa9363cf94ec2c172bdf4313a upstream.
If one block has been to written to a new place, just return
in move data process. This patch check it again with holding
page lock.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit d47b8715953ad0cf82bb0a9d30d7b11d83bc9c11 upstream.
i_times of inode will be set with current system time which can be
configured through 'date', so it's not safe to judge dnode block as
garbage data or unchanged inode depend on i_times.
Now, we have used enhanced 'cp_ver + cp' crc method to verify valid
dnode block, so I expect recoverying invalid dnode is almost not
possible.
This reverts commit 807b1e1c8e08452948495b1a9985ab46d329e5c2.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit bdb7d964c4fb73078ab21c20e8c7b7bf7e641bb6 upstream.
Previously, we assigned CURSEG_WARM_DATA for direct_io, but if we have two or
four logs, we do not use that type at all.
Let's fix it.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 9f0552e078b8a25fcf9c3950a05ea1398616d2b9 upstream.
Shouldn't update in-memory i_atime with on-disk i_mtime of inode when
recovering inode.
Shuoran found this bug which is hidden for a long time, honour is belong
to him.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 60dcedc9972d136fab1600c919dd32080bcc26f7 upstream.
We should record updating status of inode only for living inode, for those
unlinked inode it needs to clear its ino cache, otherwise after the ino
was been reused, it will cause unneeded node page writing during ->fsync.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 126606c7a99b32ba8265a51fab01533fe40c9ecc upstream.
Similarly to the regular discard, trace zone reset events.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit f46e8809e88d113ba096bcfc772b5182ce00b941 upstream.
When a zoned block device is mounted, discarding sections
contained in sequential zones must reset the zone write pointer.
For sections contained in conventional zones, the regular discard
is used if the drive supports it.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 178053e2f1f9ccdb61ff6c2bd8644b53fc98e72e upstream.
With the zoned block device feature enabled, section discard
need to do a zone reset for sections contained in sequential
zones, and a regular discard (if supported) for sections
stored in conventional zones. Avoid the need for a costly
report zones to obtain a section zone type when discarding it
by caching the types of the device zones in the super block
information. This cache is initialized at mount time for mounts
with the zoned block device feature enabled.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 3adc57e97792e4ac9f228bde802829e2e9840afe upstream.
The LFS mode is mandatory for host-managed zoned block devices as
update in place optimizations are not possible for segments in
sequential zones.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 96ba2decb4241aa2c6b61cfc8489d648769eff99 upstream.
Zone write pointer reset acts as discard for zoned block
devices. So if the zoned block device feature is enabled,
always declare that discard is enabled, even if the device
does not actually support the command.
For the same reason, prevent the use the "nodicard" mount
option.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 0ab0299835738cd407569401da1fef4c97b4419c upstream.
For zoned block devices, discard is replaced by zone reset. So
do not warn if the device does not supports discard.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit d1b959c8770260b611b9a1f0c5e8b12b7cb5b9d2 upstream.
The F2FS_FEATURE_BLKZONED feature indicates that the drive was formatted
with zone alignment optimization. This is optional for host-aware
devices, but mandatory for host-managed zoned block devices.
So check that the feature is set in this latter case.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 0bfd7a091c19132489a0f977b8dbf9f6b5ae0a1c upstream.
SMR stands for "Shingled Magnetic Recording" which makes sense
only for hard disk drives (spinning rust). The ZBC/ZAC standards
enable management of SMR disks, but solid state drives may also
support those standards. So rename the HMSMR feature to BLKZONED
to avoid a HDD centric terminology. For the same reason, rename
f2fs_sb_mounted_hmsmr to f2fs_sb_mounted_blkzoned.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 487df616dec33231c99294b906d720d256a2de16 upstream.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit ed6bd4b146527e7c6934e3582c47d7b857802676 upstream.
Report error of f2fs_fill_dentries to ->iterate_shared, otherwise when
error ocurrs, user may just list part of dirents in target directory
without any hints.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit d117b9acaeada0b243f31e0fe83e111fcc9a6644 upstream.
Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A security fix (so a maliciously corrupted file system image won't
panic the kernel) and some fixes for CONFIG_VMAP_STACK"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 230436b3ef3fd7d4a1da19edf5e87bb2d74e0fc2 upstream.
gcc is unsure about the use of last_ofs_in_node, which might happen
without a prior initialization:
fs/f2fs//git/arm-soc/fs/f2fs/data.c: In function ‘f2fs_map_blocks’:
fs/f2fs/data.c:799:54: warning: ‘last_ofs_in_node’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (prealloc && dn.ofs_in_node != last_ofs_in_node + 1) {
As pointed out by Chao Yu, the code is actually correct as 'prealloc'
is only set if the last_ofs_in_node has been set, the two always
get updated together.
This initializes last_ofs_in_node to dn.ofs_in_node for each
new dnode at the start of the 'next_block' loop, which at that
point is a correct initialization as well. I assume that compilers
that correctly track the contents of the variables and do not
warn about the condition also figure out that they can eliminate
the extra assignment here.
Fixes: 46008c6d4232 ("f2fs: support in batch multi blocks preallocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 35782b233f37e48ecc469d9c7232f3f6a7fad41a upstream.
This patch removes percpu_count usage due to performance regression in iozone.
Fixes: 523be8a6b3 ("f2fs: use percpu_counter for page counters")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 18340edc8da20b0d399eb25ba4bb631b27652f46 upstream.
This patch tries to make more clean inodes when flushing dirty inodes in
checkpoint.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 7c45729a4d6d1c90879e6c5c2df325c2f6db7191 upstream.
This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.
The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount
In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 02027d42c3f747945f19111d3da2092ed2148ac8 upstream.
This is for backport only.
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 664ba972df9b96942191db3068274cc1db899774 upstream.
We don't need to allocate bio partially in order to maximize sequential writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 15d04354555fdfe8005e1365009e349148fb5f90 upstream.
If inode becomes dirty, we need to check the # of dirty inodes whether or not
further checkpoint would be required.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit b9610bdfcbdbb6017802ec6d1e073f445c98157d upstream.
If there are a lot of dirty inodes, we need to flush all of them when doing
checkpoint. So, we need to count this for enough free space.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 02110a4fd53164db7cce3bb2780dce4d6c4e058f upstream.
This patch makes sure it returns a positive value instead of a probable
casted negative value as shrink count.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 3a2ad5672bb36ee9c07bab97dadc8b0f70d391f4 upstream.
Let build_free_nids support sync/async methods, in allocation flow of nids,
we use synchronuous method, so that we can avoid looping in alloc_nid when
free memory is low; in unblock_operations and f2fs_balance_fs_bg we use
asynchronuous method in where low memory condition can interrupt us.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>