Commit graph

24777 commits

Author SHA1 Message Date
Linus Torvalds
8a9ea3237e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
  dp83640: free packet queues on remove
  dp83640: use proper function to free transmit time stamping packets
  ipv6: Do not use routes from locally generated RAs
  |PATCH net-next] tg3: add tx_dropped counter
  be2net: don't create multiple RX/TX rings in multi channel mode
  be2net: don't create multiple TXQs in BE2
  be2net: refactor VF setup/teardown code into be_vf_setup/clear()
  be2net: add vlan/rx-mode/flow-control config to be_setup()
  net_sched: cls_flow: use skb_header_pointer()
  ipv4: avoid useless call of the function check_peer_pmtu
  TCP: remove TCP_DEBUG
  net: Fix driver name for mdio-gpio.c
  ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
  rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
  ipv4: fix ipsec forward performance regression
  jme: fix irq storm after suspend/resume
  route: fix ICMP redirect validation
  net: hold sock reference while processing tx timestamps
  tcp: md5: add more const attributes
  Add ethtool -g support to virtio_net
  ...

Fix up conflicts in:
 - drivers/net/Kconfig:
	The split-up generated a trivial conflict with removal of a
	stale reference to Documentation/networking/net-modules.txt.
	Remove it from the new location instead.
 - fs/sysfs/dir.c:
	Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
	with Eric Biederman's changes for tagged directories.
2011-10-25 13:25:22 +02:00
Stanislav Kinsbursky
16d0587090 NFSd: call svc rpcbind cleanup explicitly
We have to call svc_rpcb_cleanup() explicitly from nfsd_last_thread() since
this function is registered as service shutdown callback and thus nobody else
will done it for us.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-10-25 13:19:40 +02:00
Linus Torvalds
2d03423b23 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
  mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
  Revert "memory hotplug: Correct page reservation checking"
  Update email address for stable patch submission
  dynamic_debug: fix undefined reference to `__netdev_printk'
  dynamic_debug: use a single printk() to emit messages
  dynamic_debug: remove num_enabled accounting
  dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions
  uio: Support physical addresses >32 bits on 32-bit systems
  sysfs: add unsigned long cast to prevent compile warning
  drivers: base: print rejected matches with DEBUG_DRIVER
  memory hotplug: Correct page reservation checking
  memory hotplug: Refuse to add unaligned memory regions
  remove the messy code file Documentation/zh_CN/SubmitChecklist
  ARM: mxc: convert device creation to use platform_device_register_full
  new helper to create platform devices with dma mask
  docs/driver-model: Update device class docs
  docs/driver-model: Document device.groups
  kobj_uevent: Ignore if some listeners cannot handle message
  dynamic_debug: make netif_dbg() call __netdev_printk()
  dynamic_debug: make netdev_dbg() call __netdev_printk()
  ...
2011-10-25 12:13:59 +02:00
Linus Torvalds
59e5253417 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
  MAINTAINERS: linux-m32r is moderated for non-subscribers
  linux@lists.openrisc.net is moderated for non-subscribers
  Drop default from "DM365 codec select" choice
  parisc: Kconfig: cleanup Kernel page size default
  Kconfig: remove redundant CONFIG_ prefix on two symbols
  cris: remove arch/cris/arch-v32/lib/nand_init.S
  microblaze: add missing CONFIG_ prefixes
  h8300: drop puzzling Kconfig dependencies
  MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
  tty: drop superfluous dependency in Kconfig
  ARM: mxc: fix Kconfig typo 'i.MX51'
  Fix file references in Kconfig files
  aic7xxx: fix Kconfig references to READMEs
  Fix file references in drivers/ide/
  thinkpad_acpi: Fix printk typo 'bluestooth'
  bcmring: drop commented out line in Kconfig
  btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
  doc: raw1394: Trivial typo fix
  CIFS: Don't free volume_info->UNC until we are entirely done with it.
  treewide: Correct spelling of successfully in comments
  ...
2011-10-25 12:11:02 +02:00
Dmitry Monakhov
750c9c47a5 ext4: remove messy logic from ext4_ext_rm_leaf
- Both callers(truncate and punch_hole) already aligned left end point
  so we no longer need split logic here.
- Remove dead duplicated code.
- Call ext4_ext_dirty only after we have updated eh_entries, otherwise
  we'll loose entries update. Regression caused by d583fb87a3
  266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25 05:35:05 -04:00
Linus Torvalds
36b8d186e6 Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
  TOMOYO: Fix incomplete read after seek.
  Smack: allow to access /smack/access as normal user
  TOMOYO: Fix unused kernel config option.
  Smack: fix: invalid length set for the result of /smack/access
  Smack: compilation fix
  Smack: fix for /smack/access output, use string instead of byte
  Smack: domain transition protections (v3)
  Smack: Provide information for UDS getsockopt(SO_PEERCRED)
  Smack: Clean up comments
  Smack: Repair processing of fcntl
  Smack: Rule list lookup performance
  Smack: check permissions from user space (v2)
  TOMOYO: Fix quota and garbage collector.
  TOMOYO: Remove redundant tasklist_lock.
  TOMOYO: Fix domain transition failure warning.
  TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
  TOMOYO: Simplify garbage collector.
  TOMOYO: Fix make namespacecheck warnings.
  target: check hex2bin result
  encrypted-keys: check hex2bin result
  ...
2011-10-25 09:45:31 +02:00
Boaz Harrosh
44231e686b ore: Enable RAID5 mounts
Now that we support raid5 Enable it at mount. Raid6 will come next
raid4 is not demanded for so it will probably not be enabled.
(Until some one wants it)

NOTE: That mkfs.exofs had support for raid5/6 since long time
ago. (Making an empty raidX FS is just as easy as raid0 ;-} )

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 17:22:29 -07:00
Boaz Harrosh
dd29661997 exofs: Support for RAID5 read-4-write interface.
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.

Also I commented out and NULLed the .writepage (singular)
vector. Because it gives terrible write pattern to raid
and is apparently not needed. Even in OOM conditions the
system copes (even better) with out it.

TODO: How to specify to write_cache_pages() to start
      or include a certain page?

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 17:22:28 -07:00
Boaz Harrosh
769ba8d920 ore: RAID5 Write
This is finally the RAID5 Write support.

The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.

The main algorithm behind the XOR engine is the 2 dimensional array:
	struct __stripe_pages_2d.
A drawing might save 1000 words
---

__stripe_pages_2d
       |
 n = pages_in_stripe_unit;
 w = group_width - parity;
       |                            pages array presented to the XOR lib
       |                                                |
       V                                                |
 __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
       |                                                |
 __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
       |
...    |                         ...
       |
 __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
                               ^
                               |
           data added columns first then row

---
The pages are put on this array columns first. .i.e:
	p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.

Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.

The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.

In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning  or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)

But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)

What else? Well read it and tell me.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 17:15:33 -07:00
Boaz Harrosh
a1fec1dbbc ore: RAID5 read
This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.

It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.

Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.

Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
 fixed here)

The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 16:55:36 -07:00
Boaz Harrosh
3e335672e0 fs/Makefile: Always inspect exofs/
fs/exofs directory has multiple targets now, of which the
ore.ko will be needed by the pnfs-objects-layout-driver
(fs/nfs/objlayout).

As suggested by: Michal Marek <mmarek@suse.cz>  convert
inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y).
So ORE can be selected also from fs/nfs/Kconfig

CC: Michal Marek <mmarek@suse.cz>
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 16:36:33 -07:00
Boaz Harrosh
611d7a5dc6 ore: Make ore_calc_stripe_info EXPORT_SYMBOL
ore_calc_stripe_info is needed by exofs::export.c
for the layout calculations. Make it exportable

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-24 16:30:08 -07:00
David S. Miller
1805b2f048 Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-10-24 18:18:09 -04:00
Pavel Shilovsky
32b9aaf1a5 CIFS: Make cifs_push_locks send as many locks at once as possible
that reduces a traffic and increases a performance.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-24 13:11:55 -05:00
Pavel Shilovsky
9ee305b70e CIFS: Send as many mandatory unlock ranges at once as possible
that reduces a traffic and increases a performance.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-24 13:11:52 -05:00
Pavel Shilovsky
4f6bcec910 CIFS: Implement caching mechanism for posix brlocks
to handle all lock requests on the client in an exclusive oplock case.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-24 12:29:27 -05:00
Pavel Shilovsky
85160e03a7 CIFS: Implement caching mechanism for mandatory brlocks
If we have an oplock and negotiate mandatory locking style we handle
all brlock requests on the client.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-24 12:27:01 -05:00
Aneesh Kumar K.V
348b59012e net/9p: Convert net/9p protocol dumps to tracepoints
This helps in more control over debugging.
root@qemu-img-64:~# ls /pass/123
ls: cannot access /pass/123: No such file or directory
root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
              ls-1536  [001]    70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1)
000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01
010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00

              ls-1536  [001]    70.928587: <stack trace>
 => trace_9p_protocol_dump
 => p9pdu_finalize
 => p9_client_rpc
 => p9_client_walk
 => v9fs_vfs_lookup
 => d_alloc_and_lookup
 => walk_component
 => path_lookupat
              ls-1536  [000]    70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1)
000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00
010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00

              ls-1536  [000]    70.929697: <stack trace>
 => trace_9p_protocol_dump
 => p9_client_rpc
 => p9_client_walk
 => v9fs_vfs_lookup
 => d_alloc_and_lookup
 => walk_component
 => path_lookupat
 => do_path_lookup

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-10-24 11:13:12 -05:00
Aneesh Kumar K.V
4d5077f1b2 fs/9p: Cleanup option parsing in 9p
Instead of saying all integer argument option should be listed in the beginning
move integer parsing to each option type.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-10-24 11:13:12 -05:00
Aneesh Kumar K.V
464f5ecf00 fs/9p: inode file operation is properly initialized init_special_inode
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-10-24 11:13:11 -05:00
Aneesh Kumar K.V
abfa034e4b fs/9p: Update zero-copy implementation in 9p
* remove lot of update to different data structure
* add a seperate callback for zero copy request.
* above makes non zero copy code path simpler
* remove conditionalizing TREAD/TREADDIR/TWRITE in the zero copy path
* Fix the dotu p9_check_errors with zero copy. Add sufficient doc around
* Add support for both in and output buffers in zero copy callback
* pin and unpin pages in the same context
* use helpers instead of defining page offset and rest of page ourself
* Fix mem leak in p9_check_errors
* Remove 'E' and 'F' in p9pdu_vwritef

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-10-24 11:13:11 -05:00
Tao Ma
9562ad9ab3 block: Remove the control of complete cpu from bio.
bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken.  The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:11:30 +02:00
David Sterba
dff51cd1c6 btrfs: ratelimit WARN_ON in use_block_rsv
The WARN_ON under some circumstances heavily polute log and slow down
the machine. This is just a safety, as the warning should be fixed by
another patch, nevertheless, it still pops up during testing.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-24 14:48:00 +02:00
David Sterba
a81d3b1ba2 Merge branch 'hotfixes-20111024/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:58 +02:00
David Sterba
afd582ac8f Merge remote-tracking branch 'remotes/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:57 +02:00
David Sterba
f9d9ef62cd btrfs: do not allow mounting non-subvolumes via subvol option
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.

(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume

(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'

$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap

The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .

This patch makes mount fail with EINVAL with a message in syslog.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-24 14:43:25 +02:00
Mi Jinlong
345c284290 nfs41: implement DESTROY_CLIENTID operation
According to rfc5661 18.50, implement DESTROY_CLIENTID operation.

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-24 04:24:30 -04:00
Benny Halevy
92bac8c5d6 nfsd4: typo logical vs bitwise negate for want_mask
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-24 04:24:29 -04:00
Benny Halevy
c668fc6dfc nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL | NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED
RFC5661 says:
   The client may set one or both of
   OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and
   OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED.

Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-24 04:24:28 -04:00
Benny Halevy
fc0c3dd13b nfsd4: seq->status_flags may be used unitialized
Reported-by: Gopala Suryanarayana <gsuryanarayana@vmware.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-24 04:24:28 -04:00
Benny Halevy
5423732a71 nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-10-24 04:24:27 -04:00
Pavel Shilovsky
42274bb22a CIFS: Fix DFS handling in cifs_get_file_info
We should call cifs_all_info_to_fattr in rc == 0 case only.

Cc: <stable@kernel.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-22 12:29:35 -05:00
Dmitry Monakhov
1939dd84b3 ext4: cleanup ext4_ext_grow_indepth code
Currently code make an impression what grow procedure is very complicated
and some mythical paths, blocks are involved. But in fact grow in depth
it relatively simple procedure:
 1) Just create new meta block and copy root data to that block.
 2) Convert root from extent to index if old depth == 0
 3) Update root block pointer

This patch does:
 - Reorganize code to make it more self explanatory
 - Do not pass path parameter to new_meta_block() in order to
   provoke allocation from inode's group because top-level block
   should site closer to it's inode, but not to leaf data block.

   [ This happens anyway, due to logic in mballoc; we should drop
     the path parameter from new_meta_block() entirely.  -- tytso ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-22 01:26:05 -04:00
Pavel Shilovsky
a2d6b6cacb CIFS: Fix error handling in cifs_readv_complete
In cifs_readv_receive we don't update rdata->result to error value
after kmap'ing a page. We should kunmap the page in the no error
case only.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-21 09:21:04 -05:00
Steven Whitehouse
b99b98dc26 GFS2: Move readahead of metadata during deallocation into its own function
Move the recently added readahead of the indirect pointer
tree during deallocation into its own function in order
that we can use it elsewhere in the future. Also this
fixes the resetting of the "first" variable in the
original patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:54 +01:00
Steven Whitehouse
9ae32429fe GFS2: Remove two unused variables
The two variables being initialised in gfs2_inplace_reserve
to track the file & line number of the caller are never
used, so we might as well remove them.

If something does go wrong, then a stack trace is probably
more useful anyway.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:52 +01:00
Steven Whitehouse
891a8e9335 GFS2: Misc fixes
Some items picked up through automated code analysis. A few bits
of unreachable code and two unchecked return values.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:51 +01:00
Benjamin Marzinski
64dd153c83 GFS2: rewrite fallocate code to write blocks directly
GFS2's fallocate code currently goes through the page cache. Since it's only
writing to the end of the file or to holes in it, it doesn't need to, and it
was causing issues on low memory environments. This patch pulls in some of
Steve's block allocation work, and uses it to simply allocate the blocks for
the file, and zero them out at allocation time.  It provides a slight
performance increase, and it dramatically simplifies the code.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:49 +01:00
Bob Peterson
bd5437a7d4 GFS2: speed up delete/unlink performance for large files
This patch improves the performance of delete/unlink
operations in a GFS2 file system where the files are large
by adding a layer of metadata read-ahead for indirect blocks.
Mileage will vary, but on my system, deleting an 8.6G file
dropped from 22 seconds to about 4.5 seconds.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:47 +01:00
Steven Whitehouse
f75bbfb4dd GFS2: Fix off-by-one in gfs2_blk2rgrpd
Bob reported:

I found an off-by-one problem with how I coded this section:
It should be:

+ else if (blk >= cur->rd_data0 + cur->rd_data)

In fact, cur->rd_data0 + cur->rd_data is the start of the next
rgrp (the next ri_addr), so without the "=" check it can land on
the wrong rgrp.

In all normal cases, this won't be a problem: you're searching
for a block _within_ the rgrp, which will pass the test properly.
Where it gets into trouble is if you search the rgrps for the
block exactly equal to ri_addr.  I don't think anything in the
kernel does this, but I found a place in gfs2-utils gfs2_edit
where it does.  So I definitely need to fix it in libgfs2.  I'd
like to suggest we fix it in the kernel as well for the sake of
keeping the functions similar.

So this patch fixes the above mentioned off by one error as well
as removing the unused parent pointer.

Reported-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:46 +01:00
Steven Whitehouse
13d921e371 GFS2: Clean up ->page_mkwrite
This patch brings gfs2's ->page_mkwrite uptodate with respect to the
expectations set by the VM. Also added is a check to wait if the fs
is frozen, before we attempt to get a glock. This will only work on
the node which initiates the freeze, but thats ok since the transaction
lock will still provide the expected barrier on other nodes.

The major change here is that we return a locked page now, except when
we don't return a page at all (error cases). This removes the race
which required rechecking the page after it was returned.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Nick Piggin <npiggin@kernel.dk>
2011-10-21 12:39:44 +01:00
Steven Whitehouse
ccad4e147a GFS2: Correctly set goal block after allocation
The new goal block should be set to the end of the newly
allocated extent, not the start of it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:42 +01:00
Steven Whitehouse
b5b24d7aeb GFS2: Fix AIL flush issue during fsync
Unfortunately, it is not enough to just ignore locked buffers during
the AIL flush from fsync. We need to be able to ignore all buffers
which are locked, dirty or pinned at this stage as they might have
been added subsequent to the log flush earlier in the fsync function.

In addition, this means that we no longer need to rely on i_mutex to
keep out writes during fsync, so we can, as a side-effect, remove
that protection too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-By: Abhijith Das <adas@redhat.com>
2011-10-21 12:39:41 +01:00
Steven Whitehouse
70b0c3656f GFS2: Use cached rgrp in gfs2_rlist_add()
Each block which is deallocated, requires a call to gfs2_rlist_add()
and each of those calls was calling gfs2_blk2rgrpd() in order to
figure out which rgrp the block belonged in. This can be speeded up
by making use of the rgrp cached in the inode. We also reset this
cached rgrp in case the block has changed rgrp. This should provide
a big reduction in gfs2_blk2rgrpd() calls during deallocation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:39 +01:00
Steven Whitehouse
d56fa8a1c1 GFS2: Call do_strip() directly from recursive_scan()
The recursive_scan() function only ever takes a single "bc"
argument, so we might as well just call do_strip() directly
from resource_scan() rather than pass it in as an argument.

Also the "data" argument is always a struct strip_mine, so
we can pass that in, rather than using a void pointer.

This also moves do_strip() ahead of recursive_scan() so that
we don't need to add a prototype.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:38 +01:00
Steven Whitehouse
534029e2fd GFS2: Remove obsolete assert
Given that a resource group has been locked, there is no reason why
we should not be able to allocate as many blocks as are free. The
al_requested parameter should really be considered as a minimum
number of blocks to be available. Should this limit be overshot,
there are other mechanisms which will prevent over allocation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:36 +01:00
Steven Whitehouse
54335b1fca GFS2: Cache the most recently used resource group in the inode
This means that after the initial allocation for any inode, the
last used resource group is cached in the inode for future use.
This drastically reduces the number of lookups of resource
groups in the common case, and this the contention on that
data structure.

The allocation algorithm is the same as previously, except that we
always check to see if the goal block is within the cached rgrp
first before going to the rbtree to look one up.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:34 +01:00
Steven Whitehouse
8339ee543e GFS2: Make resource groups "append only" during life of fs
Since we have ruled out supporting online filesystem shrink,
it is possible to make the resource group list append only
during the life of a super block. This gives several benefits:

Firstly, we only need to read new rindex elements as they are added
rather than needing to reread the whole rindex file each time one
element is added.

Secondly, the rindex glock can be held for much shorter periods of
time, and is completely removed from the fast path for allocations.
The lock is taken in shared mode only when updating the resource
groups when the first allocation occurs, and after a grow has
taken place.

Thirdly, this results in a reduction in code size, and everything
gets a lot simpler to understand in this area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-10-21 12:39:33 +01:00
Bob Peterson
7c9ca62113 GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme
Here is an update of Bob's original rbtree patch which, in addition, also
resolves the rather strange ref counting that was being done relating to
the bitmap blocks.

Originally we had a dual system for journaling resource groups. The metadata
blocks were journaled and also the rgrp itself was added to a list. The reason
for adding the rgrp to the list in the journal was so that the "repolish
clones" code could be run to update the free space, and potentially send any
discard requests when the log was flushed. This was done by comparing the
"cloned" bitmap with what had been written back on disk during the transaction
commit.

Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
until the journal had been flushed. For that reason, there was a rather
complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
count on the buffers.

However, the journal maintains a reference count on the buffers anyway, since
they are being journaled as metadata buffers. So by moving the code which deals
with the post-journal accounting for bitmap blocks to the metadata journaling
code, we can entirely dispense with the rather strange buffer ref counting
scheme and also the requirement to journal the rgrps.

The net result of all this is that the ->sd_rindex_spin is left to do exactly
one job, and that is to look after the rbtree or rgrps.

This patch is designed to be a stepping stone towards using RCU for the rbtree
of resource groups, however the reduction in the number of uses of the
->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
anyway.

The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
be removed in future in favour of calling the functions directly where required
in the code. That will allow locking of resource groups without needing to
actually read them in - something that could be useful in speeding up statfs.

In the mean time though it is valid to dereference ->bi_bh only when the rgrp
is locked. This is basically the same rule as before, modulo the references not
being valid until the following journal flush.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
2011-10-21 12:39:31 +01:00
Steven Whitehouse
9453615a1a GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added
We need to take the inode's glock whenever the inode's size
is referenced, otherwise it might not be uptodate. Even
though generic_file_llseek_unlocked() doesn't implement
SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
size in those cases, so we need to add them to the list
of origins which need the glock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
2011-10-21 12:39:29 +01:00