When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.
Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs. The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.
(Seems like a sysfs-notify-on-removal patch is a better fix. Currently
removal updates the event count but does not wake up waiters)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.
Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen. So explicity test for the bits that
we are really interested in.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When performing a "recovery" or "check" pass on a RAID1 array, we read
from each device and possible, if there is a difference or a read error,
write back to some devices.
We use the same 'bio' for both read and write, resetting various fields
between the two operations.
We forgot to reset bv_offset and bv_len however. These are often left
unchanged, but in the case where there is an IO error one or two sectors
into a page, they are changed.
This results in correctable errors not being corrected properly. It does
not result in any data corruption.
Cc: "Fairbanks, David" <David.Fairbanks@stratus.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Last night we had scsi problems and a hardware raid unit was offlined
during heavy i/o. While this happened we got for about 3 minutes a huge
number messages like these
Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).
I guess the high error rate is responsible for not scheduling other events
- during this time the system was not pingable and in the end also other
devices run into scsi command timeouts causing problems on these unrelated
devices as well.
Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the trivial and rather pointless file_path wrapper around d_path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.
When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".
However we are sampling the value of "mddev->bitmap" before the
wait_barrier call, and using it afterwards. This can result in using a
bitmap structure that has been freed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for the InstaShield IS-400 four port RS-232 PCI card.
Signed-off-by: Ignacio García Pérez <iggarpe@t2i.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This driver reads IBM Active Energy Manager energy/temperature/power
sensors on IBM System X hardware.
[akpm@linux-foundation.org: fix printk warnings]
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: "Mark M. Hoffman" <mhoffman@lightlink.com>
Cc: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minor rework to support the Intel 5400 chipset.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: "Mark M. Hoffman" <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mad: Fix kernel crash when .process_mad() returns SUCCESS|CONSUMED
IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish()
MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries
IB/mlx4: Fix creation of kernel QP with max number of send s/g entries
IB/mthca: Fix max_sge value returned by query_device
RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send()
IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send()
IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate
IB/ipath: Fix printk format for ipath_sdma_status
If a low-level driver returns IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED,
handle_outgoing_dr_smp() doesn't clean up properly. The fix is to
kfree the local data and break, rather than falling through. This was
observed with the ipath driver, but could happen with any driver.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1027>.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Not sure how this snuck upstream, but it really doesn't belong there. We
don't need a KERN_ERR printk in the suspend path to know what's going on (at
least not anymore).
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
[POWERPC] iSeries: Remove unused mail address
[POWERPC] mpic: Fix use of uninitialized variable
[POWERPC] Add kernstart_addr to list of allowed symbols in prom_init
[POWERPC] Fix __set_fixmap() for STRICT_MM_TYPECHECKS
[POWERPC] PS3: Fix memory hotplug
drivers/video/aty/atyfb_base.c:3359:26: warning: Using plain integer as NULL pointer
drivers/video/aty/radeon_base.c:2280:32: warning: Using plain integer as NULL pointer
drivers/video/matrox/matroxfb_base.h:203:25: warning: Using plain integer as NULL pointer
drivers/video/matrox/matroxfb_base.h:203:25: warning: Using plain integer as NULL pointer
drivers/video/sis/sis_main.c:5790:44: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/scsi/aha152x.c:3585:60: warning: Using plain integer as NULL pointer
drivers/scsi/aha152x.c:3845:56: warning: Using plain integer as NULL pointer
drivers/scsi/qla1280.c:2814:37: warning: Using plain integer as NULL pointer
drivers/scsi/atp870u.c:750:47: warning: Using plain integer as NULL pointer
drivers/scsi/3w-9xxx.c:1281:36: warning: Using plain integer as NULL pointer
drivers/scsi/3w-9xxx.c:1293:36: warning: Using plain integer as NULL pointer
drivers/scsi/3w-9xxx.c:1301:35: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:447:10: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:457:10: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:479:24: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:483:22: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:1213:23: warning: Using plain integer as NULL pointer
drivers/scsi/hptiop.c:1214:23: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/isdn/hysdn/hycapi.c:465:42: warning: Using plain integer as NULL pointer
drivers/isdn/hysdn/hycapi.c:467:44: warning: Using plain integer as NULL pointer
drivers/isdn/hysdn/hycapi.c:469:42: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/acpi/dispatcher/dsmethod.c:568:50: warning: Using plain integer as NULL pointer
drivers/acpi/executer/exmutex.c:329:30: warning: Using plain integer as NULL pointer
drivers/acpi/executer/exmutex.c:466:31: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I don't use my IBM email address normally and people can find me in
CREDITS.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Use memmove to handle overlapping copy of data.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb for rx buffer allocation. This sets skb->dev
and can be overriden for NUMA machines.
Change code to return new buffer rather than call by reference.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb. This sets skb->dev and allows arch specific
allocation.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb. This sets skb->dev and allows arch specific
allocation.
Remove dead code and dead comments.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb. This sets skb->dev and allows arch specific
allocation.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb. This sets skb->dev and allows arch specific
allocation. Also simplify and cleanup the alignment code.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Use netdev_alloc_skb for rx buffer allocation. This sets skb->dev
and can be overriden for NUMA machines.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Remove extraneous semicolons after switch and conditional statements.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
fmvj18x_cs: The manfid of "NextCom NC5310 rev B" is MANF_ID_FUJITSU.
but this card is MBH10302 based card.
use ConfigBase to detect the cardtype for this card.
Signed-off-by: Komuro <komurojun-mbn@nifty.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
If alloc_skb failed, the recieved packet will be dropped. Do not increase
rx_bytes for dropped packet.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Commit cac1f3c8 factored out the code for get_phy_id so that it
could be reused in multiple places. Turns out that some of the
users can be modular, so we need to export this symbol as well.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com>
Acked-by: Eric Miao <eric.miao@marvell.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
3c515.c uses pnp_irq(), which calls pnp_get_resource(),
which is not defined when CONFIG_PNP=n, so in that case,
get the IRQ from a hardware register.
3c515.c:(.text+0x3adc0): undefined reference to `pnp_get_resource'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Add Broadcom PHYs supported missing from the description.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Configure the BCM5482S secondary SerDes for 1000Base-X mode when the
appropriate dev_flags are passed in to phy_connect(). This is
needed when the PHY is used for fiber and backplane connections.
Signed-off-by: Nate Case <ncase@xes-inc.com>
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Add a "follow" selection for fail_over_mac. This option
causes the MAC address to move from slave to slave as the active
slave changes. This is in addition to the existing fail_over_mac option
that causes the bond's MAC address to change during failover.
This new option is useful for devices that cannot tolerate
multiple ports using the same MAC address simultaneously, either
because it confuses them or incurs a performance penalty (as is the
case with some LPAR-aware multiport devices). Because the MAC of the
bond itself does not change, the "follow" option is slightly more
reliable during failover and doesn't change the MAC of the bond during
operation.
This patch requires a previous ARP monitor change to properly
handle RTNL during failovers.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Refactor ARP monitor for active-backup mode. The motivation for
this is to take care of locking issues in a clear manner (particularly to
correctly handle RTNL vs. the bonding locks). Currently, the a-b ARP
monitor does not hold RTNL at all, but future changes will require RTNL
during ARP monitor failovers.
Rather than using conditional locking, this patch instead breaks
up the ARP monitor into three discrete steps: inspection, commit changes,
and probe. The inspection phase marks slaves that require link state
changes. The commit phase is only called if inspection detects that
changes are needed, and is called with RTNL. Lastly, the probe phase
issues the ARP probes that the inspection phase uses to determine link
state.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
With IPoIB, reception of gratuitous ARP by neighboring hosts
is essential for a successful change of slaves in case of failure.
Otherwise, they won't learn about the HW address change and need
to wait a long time until the neighboring system gives up and sends
an ARP request to learn the new HW address. This patch decreases
the chance for a lost of a gratuitous ARP packet by sending it more
than once. The number retries is configurable and can be set with a
module param.
Signed-off-by: Moni Shoua <monis@voltaire.com>
Acked-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Some places iterate over the checked list right after the check
itself, so even if the list is empty, the list_for_each_xxx
iterator will make everything right by himself.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Many places either do not modify the list under the list_for_each_xxx,
or break out of the loop as soon as the first element is removed.
Thus, this _safe iteration just occupies some unneeded .text space
and requires an additional variable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
While we're fixing the bond_create, I hope it's OK to polish it
a bit after the fixes.
The third argument is NULL at the first caller and is ignored by
the second one, so remove it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Remove bond_has_ip and all references to it. With this change,
the ARP monitor will always send ARP probes if the master is up and has
at least one slave. If the bond has an IP address, it is used in the
ARP probe; if not, the probes are sent with all zeros in the sender's
IP address (which is consistent with an RFC 2131 4.4.1 duplicate address
probe).
This is useful for cases when bonding itself is hidden underneath
a layer of virtual devices, e.g., with Xen.
Change suggested by Tsutomu Fujii <t-fujii@nb.jp.nec.com>, who
included a one-line patch that only affected active-backup mode.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Convert bonding to use msecs_to_jiffies instead of doing the
math. For the ARP monitor, there was an underflow problem that could
result in an infinite loop. The miimon already had that worked around,
but this is cleaner.
Originally by Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Jay Vosburgh corrected a math error in the original; Nicolas' original
commit message is:
When setting arp_interval parameter to a very low value, delta_in_ticks
for next arp might become 0, causing an infinite loop.
See http://bugzilla.kernel.org/show_bug.cgi?id=10680
Same problem for miimon parameter already fixed, but fix might be
enhanced, by using msecs_to_jiffies() function.
Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>