These are very small changes that make use osd_data local pointers
as shorthands for structures being operated on.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.
We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write. A local variable
will be used to keep track of the allocated array of pages. Wait
until just before submitting the request for assigning that page
array pointer to the request message.
Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.
Previously, ceph_writepages_start() assigned the message header's
data length because of this update. That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Defer building the osd request until just before submitting it in
all callers except ceph_writepages_start(). (That caller will be
handed in the next patch.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is a helper function alloc_page_vec() that, despite its
generic sounding name depends heavily on an osd request structure
being populated with certain information.
There is only one place this function is used, and it ends up
being a bit simpler to just open code what it does, so get
rid of the helper.
The real motivation for this is deferring building the of the osd
request message, and this is a step in that direction.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Mostly for readability, define ceph_writepages_osd_request() and
use it to allocate the osd request for ceph_writepages_start().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch moves the call to ceph_osdc_build_request() out of
ceph_osdc_new_request() and into its caller.
This is in order to defer formatting osd operation information into
the request message until just before request is started.
The only unusual (ab)user of ceph_osdc_build_request() is
ceph_writepages_start(), where the final length of write request may
change (downward) based on the current inode size or the oldest
snapshot context with dirty data for the inode.
The remaining callers don't change anything in the request after has
been built.
This means the ops array is now supplied by the caller. It also
means there is no need to pass the mtime to ceph_osdc_new_request()
(it gets provided to ceph_osdc_build_request()). And rather than
passing a do_sync flag, have the number of ops in the ops array
supplied imply adding a second STARTSYNC operation after the READ or
WRITE requested.
This and some of the patches that follow are related to having the
messenger (only) be responsible for filling the content of the
message header, as described here:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There's one spot in ceph_writepages_start() that open-codes what
page_offset() does safely. Use the macro so we don't have to worry
about wrapping.
This resolves:
http://tracker.ceph.com/issues/4648
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In create_fs_client() a memory pool is set up be used for arrays of
pages that might be needed in ceph_writepages_start() if memory is
tight. There are two problems with the way it's initialized:
- The size provided is the number of pages we want in the
array, but it should be the number of bytes required for
that many page pointers.
- The number of pages computed can end up being 0, while we
will always need at least one page.
This patch fixes both of these problems.
This resolves the two simple problems defined in:
http://tracker.ceph.com/issues/4603
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks. Simplifies the external
interface.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one. In the meantime, when
we rotate our service keys, the authorizer doesn't get updated. Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.
Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret. If it
is not, we will build a new one that does. This avoids the transient
failure.
This fixes one of the sorry sequence of events for bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
We should advance the user data pointer by _len_ instead of _written_.
_len_ is the data length written in each iteration while _written_ is the
accumulated data length we have writtent out.
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Tested-by: Sage Weil <sage@inktank.com>
Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.
This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Change it so we only assign outgoing data information for messages
if there is outgoing data to send.
This then allows us to add a few more (currently commented-out)
assertions.
This is related to:
http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.
Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)
Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(This is being reposted. The first one had a problem because it
erroneously added a similar change elsewhere; that change has been
dropped.)
The next patch in this series points out that the calculation for
the number of pages in an osd request is getting done twice. It
is not obvious, but the result of both calculations is identical.
This patch simplifies one of them--as a separate step--to make
it clear that the transformation in the next patch is valid.
In ceph_sync_write() there is some magic that computes page_align
for an osd request. But a little analysis shows it can be
simplified.
First, we have:
io_align = pos & ~PAGE_MASK;
which is used here:
page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
Note (pos - io_align) simply rounds "pos" down to the nearest multiple
of the page size.
We also have:
buf_align = (unsigned long)data & ~PAGE_MASK;
Adding buf_align to that rounded-down "pos" value will stay within
the same page; the result will just be offset by the page offset for
the "data" pointer. The final mask therefore leaves just the value
of "buf_align".
One more simplification. Note that the result of calc_pages_for()
is invariant of which page the offset starts in--the only thing that
matters is the offset within the starting page. We will have
put the proper page offset to use into "page_align", so just use
that in calculating num_pages.
This resolves:
http://tracker.ceph.com/issues/4166
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There's a spot that computes the number of pages to allocate for a
page-aligned length by just shifting it. Use calc_pages_for()
instead, to be consistent with usage everywhere else. The result
is the same.
The reason for this is to make it clearer in an upcoming patch that
this calculation is duplicated.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, incoming mds messages never use page data, which means
there is no need to set the page_alignment field in the message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.
This and the next patch resolve:
http://tracker.ceph.com/issues/4322
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The pagelist_count field is never actually used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Conflicts:
fs/ceph/addr.c
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().
This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull x86/efi changes from Peter Anvin:
"The bulk of these changes are cleaning up the efivars handling and
breaking it up into a tree of files. There are a number of fixes as
well.
The entire changeset is pretty big, but most of it is code movement.
Several of these commits are quite new; the history got very messed up
due to a mismerge with the urgent changes for rc8 which completely
broke IA64, and so Ingo requested that we rebase it to straighten it
out."
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: remove "kfree(NULL)"
efi: locking fix in efivar_entry_set_safe()
efi, pstore: Read data from variable store before memcpy()
efi, pstore: Remove entry from list when erasing
efi, pstore: Initialise 'entry' before iterating
efi: split efisubsystem from efivars
efivarfs: Move to fs/efivarfs
efivars: Move pstore code into the new EFI directory
efivars: efivar_entry API
efivars: Keep a private global pointer to efivars
efi: move utf16 string functions to efi.h
x86, efi: Make efi_memblock_x86_reserve_range more readable
efivarfs: convert to use simple_open()
Move non-public declarations and definitions from linux/proc_fs.h to
fs/proc/internal.h.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make the PROC_I() and PDE() macros internal to procfs. This means making
PDE_DATA() out of line. This could be made more optimal by storing
PDE()->data into inode->i_private.
Also provide a __PDE_DATA() that is inline and internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't access the proc_dir_entry in ReiserFS's r_open(), r_start() r_show()
procfs interface functions.
ReiserFS stores the ->show() method pointer in PDE->data and the super_block
pointer in PDE->parent->data. This isn't changing.
Currently, ReiserFS passes the PDE pointer into seq_file::private from
r_open() so that r_start() and r_show() can then access it. Instead, use
seq_open_private() to allocate a two-pointer struct that's passed through
seq_file::private and put the ->show() method and the sb pointers in there.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: reiserfs-devel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Supply an accessor function for getting the private data from the parent
proc_dir_entry struct of the proc_dir_entry struct associated with an inode.
ReiserFS, for instance, stores the super_block pointer in the proc directory
it makes for that super_block, and a pointer to the respective seq_file show
function in each of the proc files in that directory.
This allows a reduction in the number of file_operations structs, open
functions and seq_operations structs required. The problem otherwise is that
each show function requires two pieces of data but only has storage for one
per PDE (and this has no release function).
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: Maxim Mikityanskiy <maxtram95@gmail.com>
cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
cc: linux-wireless@vger.kernel.org
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add proc_mkdir_data() to allow procfs directories to be created that are
annotated at the time of creation with private data rather than doing this
post-creation. This means no access is then required to the proc_dir_entry
struct to set this.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Neela Syam Kolli <megaraidlinux@lsi.com>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
cc: linux-wireless@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move some bits from linux/proc_fs.h to linux/of.h, signal.h and tty.h.
Also move proc_tty_init() and proc_device_tree_init() to fs/proc/internal.h as
they're internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-arch@vger.kernel.org
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jri Slaby <jslaby@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move PDE_NET() to fs/proc/proc_net.c as that's where the only user is.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>