If we have two tasks racing to update a mirror's credentials, then they
can end up leaking one (or more) sets of credentials. The first task
will set mirror->cred and then the second task will just overwrite it.
Use a cmpxchg to ensure that the creds are only set once. If we get to
the point where we would set mirror->cred and find that they're already
set, then we just release the creds that were just found.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...
Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:
- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.
- Various cleanups around command types from Christoph.
- Cleanup of the suspend block I/O path from Christoph.
- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.
- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.
- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.
- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"
* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...
* Minor fixes for UBI and UBIFS
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJVi9DTAAoJEPmeIjfg4UaaOlAP/AtZqcxpCHlKpbZUz7z1MmCb
MP9AbowP0CSXrMYSA7GrBROw+bfTM7aZrhWqhhoJxAyBbb9LGldjW9xZmUqUvS5e
LZjAO8sWqxEbfVBqnpTgA2AiSN37I3+epYqgDFNlNCtxPYxkEXpXVoLUSdrAUKw8
24EjPe0tq1JKLmjh3XJQZJnmCW+7xPKRUj7KiwPZzombIoNXFA9z4PCOLLlnDx1X
TRasArrncarslQaHQerRoZUKC/hoIGZMipv94pNovgO8HAcZ/N19ef1Fc4iG5zvS
XUMerUyKMTVfR6LFjueMjRpEMOmDb4AifWeDKY00xqY4RCnkIcAAqVRBGWQep6Y4
PxCwttNktykWj/LygPCsQk6gd0IxdVGsFra4DpxhCCWXaUw2RJCRg0a4mQFOqojO
tmCVQVYgd4Ob2DDnxmS7Z8IvjIWzsSkgatF+Dde2feeHUwEuM9NoudHg704+A30x
3fNwIFhuREhOymxGpBaprfaKh7khF88CcrGjiEOvw/DK0QfwlVoUVLqjkNVaLZXC
nrwNATdheq2gMV2Mdydy/uo5NpUxeblSrlnuknKQSTmH5m2/U9F9lJ/v3Sv7ShjB
gWqyNIoxJBKTSbsGkFP5bzWmxwsQ4HlPTcAmJhoUDK7SLeXThhgRwB4o2zlriYfS
D7ZVxRMY5JCNxomIzD0X
=xU0H
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
"Minor fixes for UBI and UBIFS"
* tag 'upstream-4.2-rc1' of git://git.infradead.org/linux-ubifs:
UBI: Remove unnecessary `\'
UBI: Use static class and attribute groups
UBI: add a helper function for updatting on-flash layout volumes
UBI: Fastmap: Do not add vol if it already exists
UBI: Init vol->reserved_pebs by assignment
UBI: Fastmap: Rename variables to make them meaningful
UBI: Fastmap: Remove unnecessary `\'
UBI: Fastmap: Use max() to get the larger value
ubifs: fix to check error code of register_shrinker
UBI: block: Dynamically allocate minor numbers
the ext4 encryption patches, which is a new feature added in the last
merge window. Also fix a number of long-standing xfstest failures.
(Quota writes failing due to ENOSPC, a race between truncate and
writepage in data=journalled mode that was causing generic/068 to
fail, and other corner cases.)
Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
performance eliminating locking when a buffer is modified more than
once during a transaction (which is very common for allocation
bitmaps, for example), in which case the state of the journalled
buffer head doesn't need to change.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVi3PeAAoJEPL5WVaVDYGj+I0H/jRPexvyvnGfxiqs1sxIlbSk
cwewFJSsuKsy/pGYdmHvozWZyWGGORc89NrxoNwdbG+axvHbgUWt/3+vF+rzmaek
vX4v9QvCEo4PfpRgzbnYJFhbxGMJtwci887sq1o/UoNXikFYT2kz8rpdf0++eO5W
/GJNRA5ZUY0L0eeloUILAMrBr7KjtkI2oXwOZt5q68jh7B3n3XdNQXyEiQS/28aK
QYcFrqA/e2Fiuk6l5OSGBCP38mySu+x0nBTLT5LFwwrUBnoZvGtdjM6Sj/yADDDn
uP/Zpq56aLzkFRwwItrDaF26BIf2MhIH/WUYs65CraEGxjMaiPuzAudGA/iUVL8=
=1BdR
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A very large number of cleanups and bug fixes --- in particular for
the ext4 encryption patches, which is a new feature added in the last
merge window. Also fix a number of long-standing xfstest failures.
(Quota writes failing due to ENOSPC, a race between truncate and
writepage in data=journalled mode that was causing generic/068 to
fail, and other corner cases.)
Also add support for FALLOC_FL_INSERT_RANGE, and improve jbd2
performance eliminating locking when a buffer is modified more than
once during a transaction (which is very common for allocation
bitmaps, for example), in which case the state of the journalled
buffer head doesn't need to change"
[ I renamed "ext4_follow_link()" to "ext4_encrypted_follow_link()" in
the merge resolution, to make it clear that that function is _only_
used for encrypted symlinks. The function doesn't actually work for
non-encrypted symlinks at all, and they use the generic helpers
- Linus ]
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
ext4: set lazytime on remount if MS_LAZYTIME is set by mount
ext4: only call ext4_truncate when size <= isize
ext4: make online defrag error reporting consistent
ext4: minor cleanup of ext4_da_reserve_space()
ext4: don't retry file block mapping on bigalloc fs with non-extent file
ext4: prevent ext4_quota_write() from failing due to ENOSPC
ext4: call sync_blockdev() before invalidate_bdev() in put_super()
jbd2: speedup jbd2_journal_dirty_metadata()
jbd2: get rid of open coded allocation retry loop
ext4: improve warning directory handling messages
jbd2: fix ocfs2 corrupt when updating journal superblock fails
ext4: mballoc: avoid 20-argument function call
ext4: wait for existing dio workers in ext4_alloc_file_blocks()
ext4: recalculate journal credits as inode depth changes
jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
ext4: use swap() in mext_page_double_lock()
ext4: use swap() in memswap()
ext4: fix race between truncate and __ext4_journalled_writepage()
ext4 crypto: fail the mount if blocksize != pagesize
ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
...
Before a page get locked, someone else can write data to the page
and increase the i_size. So we should re-check the i_size after
pages are locked.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
GFP_NOFS memory allocation is required for page writeback path.
But there is no need to use GFP_NOFS in syscall path and readpage
path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
if flushing caps were revoked, we should re-send the cap flush in
client reconnect stage. This guarantees that MDS processes the cap
flush message before issuing the flushing caps to other client.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
According to this information, MDS can trim its completed caps flush
list (which is used to detect duplicated cap flush).
Signed-off-by: Yan, Zheng <zyan@redhat.com>
So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.
Tracking pending caps flushing globally also simplifies syncfs code.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.
This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
fsync() on directory should flush dirty caps and wait for any
uncommitted directory opertions to commit. But ceph_dir_fsync()
only waits for uncommitted directory opertions.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Current ceph_fsync() only flushes dirty caps and wait for them to be
flushed. It doesn't wait for caps that has already been flushing.
This patch makes ceph_fsync() wait for pending flushing caps too.
Besides, this patch also makes caps_are_flushed() peroperly handle
tid wrapping.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when copying files to cephfs, file data may stay in page cache after
corresponding file is closed. Cached data use Fc capability. If we
include Fc capability in cap_wanted, MDS will treat files with cached
data as open files, and journal them in an EOpen event when trimming
log segment.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
No need to bifurcate wait now that we've got ceph_timeout_jiffies().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.
There is also a bug in the way mount_timeout=0 is handled. It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.
Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Previously we pre-allocate cap release messages for each caps. This
wastes lots of memory when there are large amount of caps. This patch
make the code not pre-allocate the cap release messages. Instead,
we add the corresponding ceph_cap struct to a list when releasing a
cap. Later when flush cap releases is needed, we allocate the cap
release messages dynamically.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps()
accesses snap realm's cached_context. So we need take read lock
of snap_rwsem.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when a snap notification contains no new snapshot, we can avoid
sending FLUSHSNAP message to MDS. But we still need to create
cap_snap in some case because it's required by write path and
page writeback path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
In most cases that snap context is needed, we are holding
reference of CEPH_CAP_FILE_WR. So we can set ceph inode's
i_head_snapc when getting the CEPH_CAP_FILE_WR reference,
and make codes get snap context from i_head_snapc. This makes
the code simpler.
Another benefit of this change is that we can handle snap
notification more elegantly. Especially when snap context
is updated while someone else is doing write. The old queue
cap_snap code may set cap_snap's context to ether the old
context or the new snap context, depending on if i_head_snapc
is set. The new queue capp_snap code always set cap_snap's
context to the old snap context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Cached_context in ceph_snap_realm is directly accessed by
uninline_data() and get_pool_perm(). This is racy in theory.
both uninline_data() and get_pool_perm() do not modify existing
object, they only create new object. So we can pass the empty
snap context to them. Unlike cached_context in ceph_snap_realm,
we do not need to protect the empty snap context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Merge first patchbomb from Andrew Morton:
- a few misc things
- ocfs2 udpates
- kernel/watchdog.c feature work (took ages to get right)
- most of MM. A few tricky bits are held up and probably won't make 4.2.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits)
mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
mm, thp: respect MPOL_PREFERRED policy with non-local node
tmpfs: truncate prealloc blocks past i_size
mm/memory hotplug: print the last vmemmap region at the end of hot add memory
mm/mmap.c: optimization of do_mmap_pgoff function
mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
mm: kmemleak: fix delete_object_*() race when called on the same memory block
mm: kmemleak: allow safe memory scanning during kmemleak disabling
memcg: convert mem_cgroup->under_oom from atomic_t to int
memcg: remove unused mem_cgroup->oom_wakeups
frontswap: allow multiple backends
x86, mirror: x86 enabling - find mirrored memory ranges
mm/memblock: allocate boot time data structures from mirrored memory
mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
mm: do not ignore mapping_gfp_mask in page cache allocation paths
mm/cma.c: fix typos in comments
mm/oom_kill.c: print points as unsigned int
mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJViyiCAAoJEKurIx+X31iB3acP/1kPYKClGZLoqH6wqHNM/djq
ROsPm9PDt9g7WZ/yw5HpKuUzC+XhOg4odYpZIy+6PogW6BUCumxPCLVq/qbVSPhT
Q7Pv0mlmjyJS+kj6FncWuJrJG0xQfKYE6OeYrnkyHfrHYmJunf1bS6K71CDGHJAa
O2bQr4E67twuU8yQR8BZ+YlZu4NTzPYZ4JWmb9Wepm3seIM2GEJsNRZ7WJXH7BOv
BHOPi8FDr8fMkA+2WitE853gYvcTYcuxlsDgRumtGzWDhRIUH8Q5yS9QLAFd0Rly
BW7YOHYCY1L75RJxnVTWd04GNrepxe4LY1bbtx+mqI6FrdMw0dK0M5BKuohV+BT0
tBC/anSHBOqua/aDA6m+8c+p8I7qp1wXNHtmm15lqKQg1YHvh0Rs7FlP3HBdLDxQ
rmUQHTcVQGaf00GCTUgsEn80kW8FYYtOnh4FJcbSkLdU8/mkr1+1rE/3i8ob/AL9
EsJgzqptT9/VBX3j6tZgk8tt6xstLMGVw/DmScjxeLqA2WoaINh5XRPgGCGWQW6T
wa9nFGBuJFtJv9NfVlMEe8xekxyDa6xPIgoCheWBcV9NFRmfBrUmbG4yF127nqTG
TXYBzFPSxdBn0FqGIaz+6RPbcGd7tZuz317sIYHTgHBWQHeoWG9TJGHeGt8L5uql
eKMoHAvVRZIyBuZruUeB
=248K
-----END PGP SIGNATURE-----
Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore updates from Tony Luck:
"Miscellaneous pstore improvements"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
ramoops: make it possible to change mem_type param.
pstore/ram: verify ramoops header before saving record
fs/pstore: Optimization function ramoops_init_przs
fs/pstore: update the backend parameter in pstore module
pstore: do not use message compression without lock
Pull f2fs updates from Jaegeuk Kim:
"New features:
- per-file encryption (e.g., ext4)
- FALLOC_FL_ZERO_RANGE
- FALLOC_FL_COLLAPSE_RANGE
- RENAME_WHITEOUT
Major enhancement/fixes:
- recovery broken superblocks
- enhance f2fs_trim_fs with a discard_map
- fix a race condition on dentry block allocation
- fix a deadlock during summary operation
- fix a missing fiemap result
.. and many minor bug fixes and clean-ups were done"
* tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
f2fs: do not trim preallocated blocks when truncating after i_size
f2fs crypto: add alloc_bounce_page
f2fs crypto: fix to handle errors likewise ext4
f2fs: drop the volatile_write flag only
f2fs: skip committing valid superblock
f2fs: setting discard option in parse_options()
f2fs: fix to return exact trimmed size
f2fs: support FALLOC_FL_INSERT_RANGE
f2fs: hide common code in f2fs_replace_block
f2fs: disable the discard option when device doesn't support
f2fs crypto: remove alloc_page for bounce_page
f2fs: fix a deadlock for summary page lock vs. sentry_lock
f2fs crypto: clean up error handling in f2fs_fname_setup_filename
f2fs crypto: avoid f2fs_inherit_context for symlink
f2fs crypto: do not set encryption policy for non-directory by ioctl
f2fs crypto: allow setting encryption policy once
f2fs crypto: check context consistent for rename2
f2fs: avoid duplicated code by reusing f2fs_read_end_io
f2fs crypto: use per-inode tfm structure
f2fs: recovering broken superblock during mount
...
Pull UDF fixes and cleanups from Jan Kara:
"The contains some small fixes and improvements in error handling for
UDF.
Bundled is also one ext3 coding style fix and a fix in quota
documentation"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: fix udf_load_pvoldesc()
udf: remove double err declaration in udf_file_write_iter()
UDF: support NFSv2 export
fs: ext3: super: fixed a space coding style issue
quota: Update documentation
udf: Return error from udf_find_entry()
udf: Make udf_get_filename() return error instead of 0 length file name
udf: bug on exotic flag in udf_get_filename()
udf: improve error management in udf_CS0toNLS()
udf: improve error management in udf_CS0toUTF8()
udf: unicode: update function name in comments
udf: remove unnecessary test in udf_build_ustr_exact()
udf: Return -ENOMEM when allocation fails in udf_get_filename()
page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.
This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway it
is better to obey mapping gfp mask and prevent from later breakage.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have many duplicates in definitions of
hugetlb_prefault_arch_hook. In all architectures this function is empty.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc/<pid>/stat and /proc/<pid>/status. The existing code reported
such threads as "Running", which is kind-of true if you think of the
case where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.
We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values; the
output is already documented in Documentation/filesystems/proc.txt, and
it seems like a mistake to change that lightly.
The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
but again, different API.
This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some functions are only used locally, so mark them as static.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
contig_blocks gotten from ocfs2_extent_map_get_blocks cannot be compared
with clusters_to_alloc. So convert it to clusters first.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.
[akpm@linux-foundation.org: update comment, per Joseph]
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2 direct read/write, OCFS2_IOCB_SEM lock type is used to protect
inode->i_alloc_sem rw semaphore lock in the earlier kernel version.
However, in the latest kernel, inode->i_alloc_sem rw semaphore lock is not
used at all, so OCFS2_IOCB_SEM lock type needs to be removed.
Signed-off-by: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jbd2_journal_dirty_metadata may fail. Currently it cannot take care of
non zero return value and just BUG in ocfs2_journal_dirty. This patch is
aborting the handle and journal instead of BUG.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_rotate_tree_left() calls __ocfs2_rotate_tree_left() for left
rotation while non-rightmost path containing an empty extent in the leaf
block. __ocfs2_rotate_tree_left() returns -EAGAIN if right subtree having an
empty extent and pass the empty_extent_path to caller. The caller
ocfs2_rotate_tree_left() will restart rotation from the returned path.
It will trigger the BUG_ON(!ocfs2_is_empty_extent) when the et on disk
is as follows:
eb0 is the leaf block of path(say path_a) passed to
ocfs2_rotate_tree_left, which has an empty rec[0].
eb1 is the leaf block of path(say path_b) that just right to path_a, which
has no empty record.
eb2 is the leaf block of path(say path_c) that just right to path_b, which
has an empty rec[0]. And path_c is also the rightmost path.
Now we want to remove the empty rec[0] in eb0:
ocfs2_rotate_tree_left:
-> call __ocfs2_rotate_tree_left with path_a as its input *path*
-> call ocfs2_rotate_subtree_left with path_a as its input
*left_path* and path_b as its input *right_path*. it will move
rec[0] in eb1 to eb0, and rec[0] in eb0 is not empty now.
-> continue to call ocfs2_rotate_subtree_left with path_b as its
input *left_path* and path_c as its input *right_path*, and
return -EAGAIN because eb2 has an empty rec[0]
-> call __ocfs2_rotate_tree_left with path_c as it input, rotate all
records in eb2 to left and return 0.
-> call __ocfs2_rotate_tree_left with path_a as its input, and triggers
the BUG_ON(!ocfs2_is_empty_extent) as the rec[0] in eb0 is not empty.
So the BUG_ON() should be removed and return 0 if rec[0] is no longer an
empty extent.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_figure_merge_contig_type() still returns CONTIG_NONE when some error
occurs which will cause an unpredictable error. So return a proper errno
when ocfs2_figure_merge_contig_type() fails.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__dlm_wait_on_lockres_flags_set() is declared but not implemented and
used. So remove it.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of 'status' in __ocfs2_add_entry() can return wrong value.
Some functions' return value in __ocfs2_add_entry(), i.e
ocfs2_journal_access_di() is saved to 'status'. But 'status' is not
used in 'bail' label for returning result of __ocfs2_add_entry().
So use retval instead of status.
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once dio crashed it will leave an entry in orphan dir. And orphan scan
will take care of the clean up. There is a tiny race case that the same
entry will be truncated twice and then trigger the BUG in
ocfs2_del_inode_from_orphan.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>