This patch fixes the problem that system may stall if target's ->map_rq
returns DM_MAPIO_REQUEUE in map_request().
E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
bounces between all-paths-down and paths-up on I/O load.
When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
the request and returns to dm_request_fn(). Then, dm_request_fn()
doesn't exit the I/O dispatching loop and continues processing
the requeued request again.
This map and requeue loop can be done with interrupt disabled,
so 1 CPU system can be stalled if this situation happens.
For example, commands below can stall my 1 CPU box within 1 minute or so:
# dmsetup table mp
mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
# while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
# while true; do \
> dmsetup message mp 0 "fail_path 8:144" \
> dmsetup suspend --noflush mp \
> dmsetup resume mp \
> dmsetup message mp 0 "reinstate_path 8:144" \
> done
To fix the problem above, this patch changes dm_request_fn() to exit
the I/O dispatching loop once if a request is requeued in map_request().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
When suspending a failed mirror, bios are completed by mirror_end_io() and
__rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is
required by design. Fix this by not changing the state of the recovery failed
region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end().
Issue
On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed
mirror by dmsetup command.
BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<f94f38e2>] dm_rh_dec+0x35/0xa1 [dm_region_hash]
...
EIP: 0060:[<f94f38e2>] EFLAGS: 00010046 CPU: 0
EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash]
EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000
ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000)
...
Call Trace:
[<f9530af6>] ? mirror_end_io+0x53/0x1b1 [dm_mirror]
[<f9413104>] ? clone_endio+0x4d/0xa2 [dm_mod]
[<f9530aa3>] ? mirror_end_io+0x0/0x1b1 [dm_mirror]
[<f94130b7>] ? clone_endio+0x0/0xa2 [dm_mod]
[<c02d6bcb>] ? bio_endio+0x28/0x2b
[<f952f303>] ? hold_bio+0x2d/0x62 [dm_mirror]
[<f952f942>] ? mirror_presuspend+0xeb/0xf7 [dm_mirror]
[<c02aa3e2>] ? vmap_page_range+0xb/0xd
[<f9414c8d>] ? suspend_targets+0x2d/0x3b [dm_mod]
[<f9414ca9>] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod]
[<f941456f>] ? dm_suspend+0x4d/0x150 [dm_mod]
[<f941767d>] ? dev_suspend+0x55/0x18a [dm_mod]
[<c0343762>] ? _copy_from_user+0x42/0x56
[<f9417fb0>] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod]
[<f9417628>] ? dev_suspend+0x0/0x18a [dm_mod]
[<f9417d84>] ? dm_ctl_ioctl+0x0/0x281 [dm_mod]
[<c02c3c4b>] ? vfs_ioctl+0x22/0x85
[<c02c422c>] ? do_vfs_ioctl+0x4cb/0x516
[<c02c42b7>] ? sys_ioctl+0x40/0x5a
[<c0202858>] ? sysenter_do_call+0x12/0x28
Analysis
When recovery process of a region failed, dm_rh_recovery_end() function
changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC.
When recovery_complete() is executed between dm_rh_update_states() and
dm_writes() in do_mirror(), bios are processed with the region state,
DM_RH_NOSYNC. However, the region data is freed without checking its
pending count when dm_rh_update_states() is called next time.
When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec()
returns NULL even though a valid return value are expected.
Solution
Remove the state change of the recovery failed region from DM_RH_RECOVERING
to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change
because:
- If the region data has been released by dm_rh_update_states(),
a new region data is created with the state of DM_RH_NOSYNC, and
bios are processed according to the DM_RH_NOSYNC state.
- If the region data has not been released by dm_rh_update_states(),
a state of the region is DM_RH_RECOVERING and bios are put in the
delayed_bio list.
The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end()
was added in the following commit:
dm raid1: handle resync failures
author Jonathan Brassow <jbrassow@redhat.com>
Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100)
http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724
Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If the mirror log fails when the handle_errors option was not selected
and there is no remaining valid mirror leg, writes return success even
though they weren't actually written to any device. This patch
completes them with EIO instead.
This code path is taken:
do_writes:
bio_list_merge(&ms->failures, &sync);
do_failures:
if (!get_valid_mirror(ms)) (false)
else if (errors_handled(ms)) (false)
else bio_endio(bio, 0);
The logic in do_failures is based on presuming that the write was already
tried: if it succeeded at least on one leg (without handle_errors) it
is reported as success.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch fixes two bugs that revolve around the miscalculation and
misuse of the variable 'overhead_size'. 'overhead_size' is the size of
the various header structures used during communication.
The first bug is the use of 'sizeof' with the pointer of a structure
instead of the structure itself - resulting in the wrong size being
computed. This is then used in a check to see if the payload
(data_size) would be to large for the preallocated structure. Since the
bug produces a smaller value for the overhead, it was possible for the
structure to be breached. (Although the current users of the code do
not currently send enough data to trigger this bug.)
The second bug is that the 'overhead_size' value is used to compute how
much of the preallocated space should be cleared before populating it
with fresh data. This should have simply been 'sizeof(struct cn_msg)'
not overhead_size. The fact that 'overhead_size' was computed
incorrectly made this problem "less bad" - leaving only a pointer's
worth of space at the end uncleared. Thus, this bug was never producing
a bad result, but still needs to be fixed - especially now that the
value is computed correctly.
Cc: stable@kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
chunk_io() declares its 'struct mdata_req' on the stack and then
initializes its 'struct work_struct' member. Annotate the
initialization of this workqueue with INIT_WORK_ON_STACK to suppress a
debugobjects warning seen when CONFIG_DEBUG_OBJECTS_WORK is enabled.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If a table containing zero as stripe count is passed into stripe_ctr
the code attempts to divide by zero.
This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count
is zero.
We now get the following error messages:
device-mapper: table: 253:0: striped: Invalid stripe count
device-mapper: ioctl: error adding target to table
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
We need to have the WUS register set to all 1's in order for the hardware
to be capable of ever waking up. Set it here in the ixgbe_probe().
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 82598 has an erratum that receipt of pause frames at 1G
could lead to a Tx Hang. To avoid this this patch disables
Rx FC while at 1G speed for all 82598 parts.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These changes add MDIO and phy lib support to the driver as the
IP core now supports the MDIO bus.
The MDIO bus and phy are added as a child to the emaclite in the device
tree as illustrated below.
mdio {
#address-cells = <1>;
#size-cells = <0>;
phy0: phy@7 {
compatible = "marvell,88e1111";
reg = <7>;
} ;
}
Signed-off-by: Sadanand Mutyala <Sadanand.Mutyala@xilinx.com>
Signed-off-by: John Linn <john.linn@xilinx.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code has been tested on IBM pSeries server.
Signed-off-by: Sathya Perla <sathyap@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RCU usage in the original code was broken because
there are cases where we possibly sleep with rcu_read_lock
held. As a fix, change the macvtap_file_get_queue to
get a reference on the socket and the netdev instead of
taking the full rcu_read_lock.
Also, change macvtap_file_get_queue failure case to
not require a subsequent macvtap_file_put_queue, as
pointed out by Ed Swierk.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ed Swierk <eswierk@aristanetworks.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver is expected to report that the link is up
when the phy Rx signal is established and the mac
has not detected a link fault.
The code is however broken, the driver does not check the link fault
status when the phy link status changes.
The link fault status being checked within a short period of time,
it leads to link up/link down events.
Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mac is expected to auto-inflate the Maximum Frame size for VLAN
tagged frames. It however does not work with jumbo frames.
Work around the bug adding 4 to the Maximum Frame for MTUs
greater than 1536.
Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent n-tuple patches added some comments to the headers
of the Flow Director functions that aren't accurate. This
cleans them up, and is a purely cosmetic patch.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
firewire: ohci: retransmit isochronous transmit packets on cycle loss
firewire: net: fix panic in fwnet_write_complete
Fix unconditional empty kerne log message every interrupt.
Kill some informational log messages that are superfluous
and anyways occur before the netdev is registered.
Signed-off-by: David S. Miller <davem@davemloft.net>
vhost-net only uses memory barriers to control SMP effects
(communication with userspace potentially running on a different CPU),
so it should use SMP barriers and not mandatory barriers for memory
access ordering, as suggested by Documentation/memory-barriers.txt
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove #define PFX
Add pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Convert printks to pr_<level>
Convert printks without levels to pr_cont
Convert pr_<level> with np->dev to netdev_<level>
Convert dev_<level> to netdev_<level>
Convert niudbg to netif_printk
Convert niuinfo, niuwarn macros to netif_<level>(priv, type, dev...
Coalesce long formats
Convert embedded function names to "%s", __func__
Always use "%s()..." when __func__ is printed
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In testing I've never seen it go past 1 retry anyways but better
safe than sorry.
Reported by Droste on irc.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Noticed on a DEC Alpha.
Start up into console mode caused 15 unaligned accesses, and starting X
caused another 48.
Signed-off-by: Matt Turner <mattst88@gmail.com>
CC: Jerome Glisse <jglisse@redhat.com>
CC: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
If the buffer object was already in the requested memory type, but
outside of the requested range it was never moved into the requested range.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
When searching for free space in a range, the function could return a node extending outside of the given range.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
In isochronous transmit DMA descriptors, link the skip address pointer
back to the descriptor itself. When a cycle is lost, the controller
will send the packet in the next cycle, instead of terminating the
entire DMA program.
There are two reasons for this:
* This behaviour is compatible with the old IEEE1394 stack. Old
applications would not expect the DMA program to stop in this case.
* Since the OHCI driver does not report any uncompleted packets, the
context would stop silently; clients would not have any chance to
detect and handle this error without a watchdog timer.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Pieter Palmers notes:
"The reason I added this retry behavior to the old stack is because some
cards now and then fail to send a packet (e.g. the o2micro card in my
dell laptop). I couldn't figure out why exactly this happens, my best
guess is that the card cannot fetch the payload data on time. This
happens much more frequently when sending large packets, which leads me
to suspect that there are some contention issues with the DMA that fills
the transmit FIFO.
In the old stack it was a pretty critical issue as it resulted in a
freeze of the userspace application.
The omission of a packet doesn't necessarily have to be an issue. E.g.
in IEC61883 streams the DBC field can be used to detect discontinuities
in the stream. So as long as the other side doesn't bail when no
[packet] is present in a cycle, there is not really a problem.
I'm not convinced though that retrying is the proper solution, but it is
simple and effective for what it had to do. And I think there are no
reasons not to do it this way. Userspace can still detect this by
checking the cycle the descriptor was sent in."
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> (changelog, comment)
The AC131 does not enable the forced transmit clock settings
immediately. The workaround is to read the register again to get the
setting to take effect.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 57765 lacks TSS support. This renders the napi assignments
incorrect in the loopback test function. This patch fixes the problem.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver puts the phy into low-power mode when it releases the device.
If the device were to be reacquired, the phy needs a reset to bring it
back to full powered operation. This patch allows phylib-enabled
devices to reset the phy.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 5717's DMA read engine has a bug when initiating multiple DMA reads
across the PCIe bus. This patch disables the feature.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On A0 revision of 57765 asic rev devices, the bootcode will perform some
hardware operations, after the magic signature is presented, that will
collide with setup operations performed by the driver. The best way to
avoid the contention is to have the driver delay an additional 10
milliseconds. B0 revisions of the chip will make this workaround
unnecessary.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Benjamin Li <benli@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch changed the code so that new rx buffer submissions to
the hardware stall if a new submission would overwrite data needed by an
unserviced rx packet. On very busy 5717 and 57765 asic rev devices,
there is a corner case where the hardware will fail to assert an MSI-X
interrupt for rx traffic. If that vector's interrupt never has another
reason to assert, any rx buffers held will never be serviced. If the
buffers are never serviced and the hardware consumes all the available
rx packets for other rx rings, deadlock will result.
The most reliable and least intrusive way to work around the problem is
to detect the case where new submissions would overwrite existing data
and force all rx interrupt vectors to fire.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When operating in RSS mode, it is possible for one rx return ring to
submit enough rx buffers back to the hardware such that it inadvertently
overwrites data needed by another rx return ring. This patch addresses
the problem by looking for non-NULL skb pointers in the
rx_[std|jmb]_buffers rings that parallel the rx producer rings.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RSS ring 1 is responsible for submitting new rx buffers to the
hardware on behalf of all the other RSS rx return rings. Up until now
this ring submitted its new rx buffers to the producer ring directly.
The following patch will require that this ring have a place to put
backlogged rx packets. As a consequence, it can no longer submit new
buffers to the producer ring.
This patch adds code to allocate an extra shadow producer ring for this
RSS ring and adds RSS ring 1 to the list of rings needing buffer
transfers.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to the igb driver for VF configuration mechanisms through the
PF interface.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add and export pci_num_vf to allow other subsystems to determine how many
virtual function devices are associated with an SR-IOV physical function
device.
Add macros dev_is_pci, dev_is_ps, and dev_num_vf to make it easier for
non-PCI specific code to determine SR-IOV capabilities.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rewrite sky2_reset to work with interrupts disabled and
avoid freeing and reallocing memory.
The old code used sky2_down and sky2_up to implement sky2_reset,
which meant interrupts could not be disabled, and the transmit and
receive ring buffers would be free'd and reallocated.
To avoid the interrupt handler waking the transmit queue while
we're doing a reset, it's better to have interrupts and NAPI
polls disabled.
Note: Modified Mike's patch to do IRQ disable in sky2_down before
calling sky2_hw_down - Stephen
Signed-off-by: Mike McCormack <mikem@ring3k.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a sky2_hw_down that brings the hardware down.
Signed-off-by: Mike McCormack <mikem@ring3k.org>
Acked-by: Stephen Hemminber <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move hardware initialization into sky2_hw_up.
Signed-off-by: Mike McCormack <mikem@ring3k.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate everything in one place so there's a single point
of failure in sky2_up, and sky2_rx_start can no longer fail.
Don't leave the hardware in a partially initialized state in the
case rx ring allocation fails.
As with the old code, the rx ring still needs to be fully
allocated for sky2_up to succeed.
Signed-off-by: Mike McCormack <mikem@ring3k.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move code to calculate receive threshold and packet size out of
sky2_rx_start() so that is can be called from elsewhere easily.
Signed-off-by: Mike McCormack <mikem@ring3k.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change how FIFO is programmed in jumbo mode (to match vendor driver).
Mostly cosmetic, the only register change is that the bits 22,23
are not programemd used.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This bit is not changed by vendor driver, and should be left alone.
The documentation implies this a debug bit.
0 = WAKE# only asserted when VMAIN not available
1 = WAKE# is depend on wake events and independent of VMAIN.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Wake On Lan code to be similar to vendor driver. The definition
of Y2_HW_WOL_ON is confusing; what it means is transition to firmware SPI
setting when doing power change.
Since same code is done for both shutdown and suspend, use common
code path.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert QPRINTK macros to netif_<level> equivalents.
Expands and standardizes the logging message output.
Removes __func__ from most logging messages.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert uses of msg_<type> to netif_<level>(
Remove msg_<type> macros
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>