-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlt/64YACgkQONu9yGCS
aT4/pBAAynguZbVbn8QtYi37Kam0R4ZvXe7rKN8H1A7rwo9l9NJDaC4z2M0Iutfj
1CfIEOeaf7WtxL25xgvDHQEOfB3/DH0xHbP/DtwqzpT0PmOUqMPaboGqOqXb//1J
WetcluEOQfoYu1DWofPP1YkAR3vU4Bp40ucAGIN2wE4bvMTR2EMbV8Y5QYgIk6mN
6n+Smg2Xpkq6paOhIrEt3C1P3lXlpY5Hxd54TGTRQ5c+vccXNldczIcP2Z0wue9/
LRI8veyY1q/IDhVR8wCrlNb3df6kUQ0xixfTNnTkUJjLs3j+NAsaJiO8/nrdCUhJ
xQORM7gQIMlccSNanKH0MHoCxhT3iMb8S6Hixvai5O+5XjP03TA7aAZ9Cyp7UqHg
JY5SPbh7YOmvRXbx7/NAgyLYwRcJRt2PamNRApLQKFbot4bSvNJquhrAib5t6kCF
HfbXjr9N969gLR4WmGkyOi0IHt8kaVwQitfBLZdj2QdlvyYWXmj0MuJ/I4BuZqtj
0MyzS/v8cxkN/NWO1p1cB7pRzFtaXtHtC6rxzYXKCUycnHW9cJDf5PBgCfDMqyTY
SdyuCeMrUo4mNEDItrKF8nbswew1T4UsayvJ6UgKHKr3QaH3Xp1mzeyt1GU38tn1
ogKm9cVbOuAhnic67ikISFsj8oNptrq0w+Zqe3AKGO8B7CwXwis=
=Q/T6
-----END PGP SIGNATURE-----
Merge 4.4.152 into android-4.4
Changes in 4.4.152
ARC: Explicitly add -mmedium-calls to CFLAGS
netfilter: ipv6: nf_defrag: reduce struct net memory waste
selftests: pstore: return Kselftest Skip code for skipped tests
selftests: static_keys: return Kselftest Skip code for skipped tests
selftests: user: return Kselftest Skip code for skipped tests
selftests: zram: return Kselftest Skip code for skipped tests
selftests: sync: add config fragment for testing sync framework
ARM: dts: Cygnus: Fix I2C controller interrupt type
usb: dwc2: fix isoc split in transfer with no data
usb: gadget: composite: fix delayed_status race condition when set_interface
usb: gadget: dwc2: fix memory leak in gadget_init()
scsi: xen-scsifront: add error handling for xenbus_printf
arm64: make secondary_start_kernel() notrace
qed: Add sanity check for SIMD fastpath handler.
enic: initialize enic->rfs_h.lock in enic_probe
net: hamradio: use eth_broadcast_addr
net: propagate dev_get_valid_name return code
ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
net: davinci_emac: match the mdio device against its compatible if possible
locking/lockdep: Do not record IRQ state within lockdep code
ipv6: mcast: fix unsolicited report interval after receiving querys
Smack: Mark inode instant in smack_task_to_inode
cxgb4: when disabling dcb set txq dcb priority to 0
brcmfmac: stop watchdog before detach and free everything
ARM: dts: am437x: make edt-ft5x06 a wakeup source
usb: xhci: increase CRS timeout value
perf test session topology: Fix test on s390
perf report powerpc: Fix crash if callchain is empty
selftests/x86/sigreturn/64: Fix spurious failures on AMD CPUs
ARM: dts: da850: Fix interrups property for gpio
dmaengine: k3dma: Off by one in k3_of_dma_simple_xlate()
md/raid10: fix that replacement cannot complete recovery after reassemble
drm/exynos: gsc: Fix support for NV16/61, YUV420/YVU420 and YUV422 modes
drm/exynos: decon5433: Fix per-plane global alpha for XRGB modes
drm/exynos: decon5433: Fix WINCONx reset value
bnx2x: Fix receiving tx-timeout in error or recovery state.
m68k: fix "bad page state" oops on ColdFire boot
HID: wacom: Correct touch maximum XY of 2nd-gen Intuos
ARM: imx_v6_v7_defconfig: Select ULPI support
ARM: imx_v4_v5_defconfig: Select ULPI support
tracing: Use __printf markup to silence compiler
kasan: fix shadow_size calculation error in kasan_module_alloc
smsc75xx: Add workaround for gigabit link up hardware errata.
netfilter: x_tables: set module owner for icmp(6) matches
ARM: pxa: irq: fix handling of ICMR registers in suspend/resume
ieee802154: at86rf230: switch from BUG_ON() to WARN_ON() on problem
ieee802154: at86rf230: use __func__ macro for debug messages
ieee802154: fakelb: switch from BUG_ON() to WARN_ON() on problem
drm/armada: fix colorkey mode property
bnxt_en: Fix for system hang if request_irq fails
perf llvm-utils: Remove bashism from kernel include fetch script
ARM: 8780/1: ftrace: Only set kernel memory back to read-only after boot
ARM: dts: am3517.dtsi: Disable reference to OMAP3 OTG controller
ixgbe: Be more careful when modifying MAC filters
packet: reset network header if packet shorter than ll reserved space
qlogic: check kstrtoul() for errors
tcp: remove DELAYED ACK events in DCTCP
drm/nouveau/gem: off by one bugs in nouveau_gem_pushbuf_reloc_apply()
net/ethernet/freescale/fman: fix cross-build error
net: usb: rtl8150: demote allmulti message to dev_dbg()
net: qca_spi: Avoid packet drop during initial sync
net: qca_spi: Make sure the QCA7000 reset is triggered
net: qca_spi: Fix log level if probe fails
tcp: identify cryptic messages as TCP seq # bugs
staging: android: ion: check for kref overflow
KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
ext4: fix spectre gadget in ext4_mb_regular_allocator()
parisc: Remove ordered stores from syscall.S
xfrm_user: prevent leaking 2 bytes of kernel memory
netfilter: conntrack: dccp: treat SYNC/SYNCACK as invalid if no prior state
packet: refine ring v3 block size test to hold one frame
bridge: Propagate vlan add failure to user
parisc: Remove unnecessary barriers from spinlock.h
PCI: hotplug: Don't leak pci_slot on registration failure
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: pciehp: Fix use-after-free on unplug
i2c: imx: Fix race condition in dma read
reiserfs: fix broken xattr handling (heap corruption, bad retval)
Linux 4.4.152
Change-Id: I1058813031709d20abd0bc45e9ac5fc68ab3a1d7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltoWioACgkQONu9yGCS
aT6YrQ//d8dWKaNZK08Z/l2ZqRS56wlNTJyHIB81p1uM2PuPHfLjsZzLQ+HnZ3Ha
G+fedEj3sbwJp8i61TRu9Q1p/PyLWsnaryWZaK3gm4Yo8GrdVbXAY47EHwz3fbUK
yxrC0+zQmIlyZgwzbUNGspDuAdNt2MFDug97RFF8BdhJd84Rv0BbicGMwKJQFfFN
g0Tv6yB+8cjmnCMjmLreLyi+puWvXZtZXAi+idl9eTC4ysGDKNvO1ERptv2NC5C6
171cbsS/ngpY5ZIUcmLy0QPPFh/ZCeoft22R3gOxZDkjT4Ro6lY5ubPKDEcn57Hv
FSV5fuQ3cBtmsODn7LMIWqLDKuCRM/gTmvXrWxM91JDLSsuAdZWATj8k4iIoocmk
l/3iOixBMFCGToQ1I2/O33QZOssKoDIz4bpG6+HM/Cj4anSnVZKjouJSTlNZr/3i
ZJOXpu/MpQItc/RGo/PumzJLkXhS+HyGwPbTIOPy29NMqp+xvjZv4DttuJbqyHJ2
Pm/OZcvU7z1wSMhcTknvZLLMQVRIICQjfPJNDefqAdrCdd233cRo37cU8egg4A0l
F3q+ZI/ny01YWQP8KrCJyWB5lHHbEc44wUHCxet0TPZ1qaqvVcXzaWhwxP2H0L3I
7r2u9bDg15ielw3jhPpRWZMvANbQlToNoj6YROqj5ArcIowcBPc=
=7/iL
-----END PGP SIGNATURE-----
Merge 4.4.146 into android-4.4
Changes in 4.4.146
MIPS: Fix off-by-one in pci_resource_to_user()
Input: elan_i2c - add ACPI ID for lenovo ideapad 330
Input: i8042 - add Lenovo LaVie Z to the i8042 reset list
Input: elan_i2c - add another ACPI ID for Lenovo Ideapad 330-15AST
tracing: Fix double free of event_trigger_data
tracing: Fix possible double free in event_enable_trigger_func()
tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
tracing: Quiet gcc warning about maybe unused link variable
xen/netfront: raise max number of slots in xennet_get_responses()
ALSA: emu10k1: add error handling for snd_ctl_add
ALSA: fm801: add error handling for snd_ctl_add
nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
mm: vmalloc: avoid racy handling of debugobjects in vunmap
mm/slub.c: add __printf verification to slab_err()
rtc: ensure rtc_set_alarm fails when alarms are not supported
netfilter: ipset: List timing out entries with "timeout 1" instead of zero
infiniband: fix a possible use-after-free bug
hvc_opal: don't set tb_ticks_per_usec in udbg_init_opal_common()
powerpc/64s: Fix compiler store ordering to SLB shadow area
RDMA/mad: Convert BUG_ONs to error flows
disable loading f2fs module on PAGE_SIZE > 4KB
f2fs: fix to don't trigger writeback during recovery
usbip: usbip_detach: Fix memory, udev context and udev leak
perf/x86/intel/uncore: Correct fixed counter index check in generic code
perf/x86/intel/uncore: Correct fixed counter index check for NHM
iwlwifi: pcie: fix race in Rx buffer allocator
Bluetooth: hci_qca: Fix "Sleep inside atomic section" warning
Bluetooth: btusb: Add a new Realtek 8723DE ID 2ff8:b011
ASoC: dpcm: fix BE dai not hw_free and shutdown
mfd: cros_ec: Fail early if we cannot identify the EC
mwifiex: handle race during mwifiex_usb_disconnect
wlcore: sdio: check for valid platform device data before suspend
media: videobuf2-core: don't call memop 'finish' when queueing
btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups
btrfs: qgroup: Finish rescan when hit the last leaf of extent tree
PCI: Prevent sysfs disable of device while driver is attached
ath: Add regulatory mapping for FCC3_ETSIC
ath: Add regulatory mapping for ETSI8_WORLD
ath: Add regulatory mapping for APL13_WORLD
ath: Add regulatory mapping for APL2_FCCA
ath: Add regulatory mapping for Uganda
ath: Add regulatory mapping for Tanzania
ath: Add regulatory mapping for Serbia
ath: Add regulatory mapping for Bermuda
ath: Add regulatory mapping for Bahamas
powerpc/32: Add a missing include header
powerpc/chrp/time: Make some functions static, add missing header include
powerpc/powermac: Add missing prototype for note_bootable_part()
powerpc/powermac: Mark variable x as unused
powerpc/8xx: fix invalid register expression in head_8xx.S
pinctrl: at91-pio4: add missing of_node_put
PCI: pciehp: Request control of native hotplug only if supported
mwifiex: correct histogram data with appropriate index
scsi: ufs: fix exception event handling
ALSA: emu10k1: Rate-limit error messages about page errors
regulator: pfuze100: add .is_enable() for pfuze100_swb_regulator_ops
md: fix NULL dereference of mddev->pers in remove_and_add_spares()
media: smiapp: fix timeout checking in smiapp_read_nvm
ALSA: usb-audio: Apply rate limit to warning messages in URB complete callback
HID: hid-plantronics: Re-resend Update to map button for PTT products
drm/radeon: fix mode_valid's return type
powerpc/embedded6xx/hlwd-pic: Prevent interrupts from being handled by Starlet
HID: i2c-hid: check if device is there before really probing
tty: Fix data race in tty_insert_flip_string_fixed_flag
dma-iommu: Fix compilation when !CONFIG_IOMMU_DMA
media: rcar_jpu: Add missing clk_disable_unprepare() on error in jpu_open()
libata: Fix command retry decision
media: saa7164: Fix driver name in debug output
mtd: rawnand: fsl_ifc: fix FSL NAND driver to read all ONFI parameter pages
brcmfmac: Add support for bcm43364 wireless chipset
s390/cpum_sf: Add data entry sizes to sampling trailer entry
perf: fix invalid bit in diagnostic entry
scsi: 3w-9xxx: fix a missing-check bug
scsi: 3w-xxxx: fix a missing-check bug
scsi: megaraid: silence a static checker bug
thermal: exynos: fix setting rising_threshold for Exynos5433
bpf: fix references to free_bpf_prog_info() in comments
media: siano: get rid of __le32/__le16 cast warnings
drm/atomic: Handling the case when setting old crtc for plane
ALSA: hda/ca0132: fix build failure when a local macro is defined
memory: tegra: Do not handle spurious interrupts
memory: tegra: Apply interrupts mask per SoC
drm/gma500: fix psb_intel_lvds_mode_valid()'s return type
ipconfig: Correctly initialise ic_nameservers
rsi: Fix 'invalid vdd' warning in mmc
audit: allow not equal op for audit by executable
microblaze: Fix simpleImage format generation
usb: hub: Don't wait for connect state at resume for powered-off ports
crypto: authencesn - don't leak pointers to authenc keys
crypto: authenc - don't leak pointers to authenc keys
media: omap3isp: fix unbalanced dma_iommu_mapping
scsi: scsi_dh: replace too broad "TP9" string with the exact models
scsi: megaraid_sas: Increase timeout by 1 sec for non-RAID fastpath IOs
media: si470x: fix __be16 annotations
drm: Add DP PSR2 sink enable bit
random: mix rdrand with entropy sent in from userspace
squashfs: be more careful about metadata corruption
ext4: fix inline data updates with checksums enabled
ext4: check for allocation block validity with block group locked
dmaengine: pxa_dma: remove duplicate const qualifier
ASoC: pxa: Fix module autoload for platform drivers
ipv4: remove BUG_ON() from fib_compute_spec_dst
net: fix amd-xgbe flow-control issue
net: lan78xx: fix rx handling before first packet is send
xen-netfront: wait xenbus state change when load module manually
NET: stmmac: align DMA stuff to largest cache line length
tcp: do not force quickack when receiving out-of-order packets
tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
tcp: do not aggressively quick ack after ECN events
tcp: refactor tcp_ecn_check_ce to remove sk type cast
tcp: add one more quick ack after after ECN events
inet: frag: enforce memory limits earlier
net: dsa: Do not suspend/resume closed slave_dev
netlink: Fix spectre v1 gadget in netlink_create()
squashfs: more metadata hardening
squashfs: more metadata hardenings
can: ems_usb: Fix memory leak on ems_usb_disconnect()
net: socket: fix potential spectre v1 gadget in socketcall
virtio_balloon: fix another race between migration and ballooning
kvm: x86: vmx: fix vpid leak
crypto: padlock-aes - Fix Nano workaround data corruption
scsi: sg: fix minor memory leak in error path
Linux 4.4.146
Change-Id: Ia7e43a90d0f5603c741811436b8de41884cb2851
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 9a9c9b51e54618861420093ae6e9b50a961914c5 ]
We want to add finer control of the number of ACK packets sent after
ECN events.
This patch is not changing current behavior, it only enables following
change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltcAuIACgkQONu9yGCS
aT632hAAgtUvSZJZTh6nMePKNys+R3XhqSbJQRqgHsWP6e8gPJ/4A5G7VmoT0STX
0QG1+J4WscOa+0E2XznYqhEJGisS32skS8VIxfWW1ISPcx4p2tgMrSJdfjxEWiKA
/7x39msawlcshITTjoRZjV60WzHM2MQgWa24ifOXrxxM+VlLcVSUehyMyYWfrZEt
hJQtz6iZp3eUvbKopJnCu7iyTFo9RJciSRUmWmYg3CDROn4HJAUgV/NdgDvHmt5J
+11WAvjQ3RdBSWy7jDadJDqy1BP2r3VdmAS1clxmjCUMsCPeHtOqNlEjc+6FhYoj
93BNcqKpqPsN2lhuHWCHcZCWLuKA2DW+Rp3l6SvfSpxd55oQeIQEnsLnyCl9XAge
YhGJZfSd/Ug/fvqlHyqKiv3J3ykCDnq6T4uzyxxmoeFgVq4RvMxSl0u9vMO9CG5u
jq0Xc19ytvUUNe0ZHXSRPbgUJCEBfWIppgoXuTL4SI/E4hmyhDqUXiSiH+Hjfufc
tnuTnSSz1CxXHct07sU/kbOTYiVHZmu/eG2Nbx+pG+d48i3/OzdW5EQ5UYvorAb3
sOkZm5Au5VP/HTJoeW7SLGeZRI0b1SxECJOg5ENmchb8sWV9MjilUoUa408rPin9
OYYQ7OKA3FHIvxlUCgw6RT6AUZQrwRRY7iAqnR46u26I5Ejif4k=
=r3bN
-----END PGP SIGNATURE-----
Merge 4.4.145 into android-4.4
Changes in 4.4.145
MIPS: ath79: fix register address in ath79_ddr_wb_flush()
ip: hash fragments consistently
net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
rtnetlink: add rtnl_link_state check in rtnl_configure_link
tcp: fix dctcp delayed ACK schedule
tcp: helpers to send special DCTCP ack
tcp: do not cancel delay-AcK on DCTCP special ACK
tcp: do not delay ACK in DCTCP upon CE status change
tcp: avoid collapses in tcp_prune_queue() if possible
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
usb: cdc_acm: Add quirk for Castles VEGA3000
usb: core: handle hub C_PORT_OVER_CURRENT condition
usb: gadget: f_fs: Only return delayed status when len is 0
driver core: Partially revert "driver core: correct device's shutdown order"
can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK
can: xilinx_can: fix recovery from error states not being propagated
can: xilinx_can: fix device dropping off bus on RX overrun
can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
can: xilinx_can: fix incorrect clear of non-processed interrupts
can: xilinx_can: fix RX overflow interrupt not being enabled
turn off -Wattribute-alias
ARM: fix put_user() for gcc-8
Linux 4.4.145
Change-Id: I449c110f7f186f2c72c9cc45e00a8deda0d54e40
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:
""" When receiving packets, the CE codepoint MUST be processed as follows:
1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
true and send an immediate ACK.
2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
to false and send an immediate ACK.
"""
Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.
Tested with this packetdrill script provided by Larry Brakmo
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlqzaAQACgkQONu9yGCS
aT6/iw//SWWvNvIYVs9SUcAk1lLRdxIPEJ2mR4CELU8S9WblInX8MS/w3+qGUGVt
9LzQUfxqbwpX0luel6QnyvME66qQI+VQ0CzA7Qtuzoy2UNlr5woW112nOzkBOuKt
+KjHg3cUwPijljEZnhPq7eEv6TrHrb/ZsU4jjTLMrlL1u/TYfJQGrU+JAFpTbHto
9IyVu/F80E/Vln2GHwYidHUbWG70lYIY75J/CXvklbfrxuRpsrOhQla52YspzI2p
B9Q5p6Ct7FIO1SEBdVdBYLMMBbrt/+dbQzHMRWyUtTudiCSgeQ+8JEWRo6Pu+dGs
1cW9uCJ+Q7AolQWicG/H80YBZczkB/p77PRCua2vVpx8e/FYSujP3E5T8Mnl7k81
mmSoNRF8SqYXSwWDCqd+c/ck4zPRKPM++eNpkor+M0QyT9MOfh9oq4vIW4HoZQEm
j+mZWUNihLO+O7znNGLoQLWn2mzDLjcrfnM6DOaKc5FJht/7OG8yyaobINYoUCmA
+1cQP0RsRrWo87zpcWoHKRSZ6Iw+YtEu3i7q2bqEdtb2Jc/88nQ5/eYcBLYDaPtw
DZ8GXL8kxXCP2J2GumwHIBTY68yZPVBknABGo5odqZwfkSgjK5fL17MzWX10Occp
0U6r+IluHkRCSxcCuyS8w10+TPn/Rw26wJCstPYargfiw/twILA=
=H2Hj
-----END PGP SIGNATURE-----
Merge 4.4.123 into android-4.4
Changes in 4.4.123
blkcg: fix double free of new_blkg in blkcg_init_queue
Input: tsc2007 - check for presence and power down tsc2007 during probe
staging: speakup: Replace BUG_ON() with WARN_ON().
staging: wilc1000: add check for kmalloc allocation failure.
HID: reject input outside logical range only if null state is set
drm: qxl: Don't alloc fbdev if emulation is not supported
ath10k: fix a warning during channel switch with multiple vaps
PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
selinux: check for address length in selinux_socket_bind()
perf sort: Fix segfault with basic block 'cycles' sort dimension
i40e: Acquire NVM lock before reads on all devices
i40e: fix ethtool to get EEPROM data from X722 interface
perf tools: Make perf_event__synthesize_mmap_events() scale
drivers: net: xgene: Fix hardware checksum setting
drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
ath10k: disallow DFS simulation if DFS channel is not enabled
perf probe: Return errno when not hitting any event
HID: clamp input to logical range if no null state
net/8021q: create device with all possible features in wanted_features
ARM: dts: Adjust moxart IRQ controller and flags
batman-adv: handle race condition for claims between gateways
of: fix of_device_get_modalias returned length when truncating buffers
solo6x10: release vb2 buffers in solo_stop_streaming()
scsi: ipr: Fix missed EH wakeup
media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
timers, sched_clock: Update timeout for clock wrap
sysrq: Reset the watchdog timers while displaying high-resolution timers
Input: qt1070 - add OF device ID table
sched: act_csum: don't mangle TCP and UDP GSO packets
ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
tcp: sysctl: Fix a race to avoid unexpected 0 window from space
dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
driver: (adm1275) set the m,b and R coefficients correctly for power
mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
blk-throttle: make sure expire time isn't too big
f2fs: relax node version check for victim data in gc
bonding: refine bond_fold_stats() wrap detection
braille-console: Fix value returned by _braille_console_setup
drm/vmwgfx: Fixes to vmwgfx_fb
vxlan: vxlan dev should inherit lowerdev's gso_max_size
NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
NFC: nfcmrvl: double free on error path
ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
powerpc: Avoid taking a data miss on every userspace instruction miss
net/faraday: Add missing include of of.h
ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
reiserfs: Make cancel_old_flush() reliable
ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
fm10k: correctly check if interface is removed
scsi: ses: don't get power status of SES device slot on probe
apparmor: Make path_max parameter readonly
iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
video: ARM CLCD: fix dma allocation size
drm/radeon: Fail fb creation from imported dma-bufs.
drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
coresight: Fixes coresight DT parse to get correct output port ID.
MIPS: BPF: Quit clobbering callee saved registers in JIT code.
MIPS: BPF: Fix multiple problems in JIT skb access helpers.
MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
regulator: isl9305: fix array size
md/raid6: Fix anomily when recovering a single device in RAID6.
usb: dwc2: Make sure we disconnect the gadget state
usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
drivers/perf: arm_pmu: handle no platform_device
perf inject: Copy events when reordering events in pipe mode
perf session: Don't rely on evlist in pipe mode
scsi: sg: check for valid direction before starting the request
scsi: sg: close race condition in sg_remove_sfp_usercontext()
kprobes/x86: Fix kprobe-booster not to boost far call instructions
kprobes/x86: Set kprobes pages read-only
pwm: tegra: Increase precision in PWM rate calculation
wil6210: fix memory access violation in wil_memcpy_from/toio_32
drm/edid: set ELD connector type in drm_edid_to_eld()
video/hdmi: Allow "empty" HDMI infoframes
HID: elo: clear BTN_LEFT mapping
ARM: dts: exynos: Correct Trats2 panel reset line
sched: Stop switched_to_rt() from sending IPIs to offline CPUs
sched: Stop resched_cpu() from sending IPIs to offline CPUs
test_firmware: fix setting old custom fw path back on exit
net: xfrm: allow clearing socket xfrm policies.
mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
ath10k: update tdls teardown state to target
cpufreq: Fix governor module removal race
clk: qcom: msm8916: fix mnd_width for codec_digcodec
ath10k: fix invalid STS_CAP_OFFSET_MASK
tools/usbip: fixes build with musl libc toolchain
spi: sun6i: disable/unprepare clocks on remove
scsi: core: scsi_get_device_flags_keyed(): Always return device flags
scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
scsi: dh: add new rdac devices
media: cpia2: Fix a couple off by one bugs
veth: set peer GSO values
drm/amdkfd: Fix memory leaks in kfd topology
agp/intel: Flush all chipset writes after updating the GGTT
mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
mac80211: remove BUG() when interface type is invalid
ASoC: nuc900: Fix a loop timeout test
ipvlan: add L2 check for packets arriving via virtual devices
rcutorture/configinit: Fix build directory error message
ima: relax requiring a file signature for new files with zero length
selftests/x86/entry_from_vm86: Exit with 1 if we fail
selftests/x86: Add tests for User-Mode Instruction Prevention
selftests/x86: Add tests for the STR and SLDT instructions
selftests/x86/entry_from_vm86: Add test cases for POPF
x86/vm86/32: Fix POPF emulation
x86/mm: Fix vmalloc_fault to use pXd_large
ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
ALSA: hda - Revert power_save option default value
ALSA: seq: Fix possible UAF in snd_seq_check_queue()
ALSA: seq: Clear client entry before deleting else at closing
drm/amdgpu/dce: Don't turn off DP sink when disconnected
fs: Teach path_connected to handle nfs filesystems with multiple roots.
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
fs/aio: Add explicit RCU grace period when freeing kioctx
fs/aio: Use RCU accessors for kioctx_table->table[]
irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
scsi: sg: fix SG_DXFER_FROM_DEV transfers
scsi: sg: fix static checker warning in sg_is_valid_dxfer
scsi: sg: only check for dxfer_len greater than 256M
ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux
btrfs: alloc_chunk: fix DUP stripe size handling
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
usb: gadget: bdc: 64-bit pointer capability check
bpf: fix incorrect sign extension in check_alu_op()
Linux 4.4.123
Change-Id: Ieb89411248f93522dde29edb8581f8ece22e33a7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit c48367427a39ea0b85c7cf018fe4256627abfd9e ]
Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)
As a result, tcp_win_from_space returns 0. It is unexpected.
Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloQBz0ACgkQONu9yGCS
aT5nQxAAs/xWKpYLSLvLPYnTOmSmNJ36isgFriVT+wWOLzkrWTWuoQnluDjMQjie
nH6whZMlOnG+k5GrGF3XymxZ66tDj9TlXnPAHCC8ikcqir2/dBO/gO5v2gmFgF2E
j52mt09I3acBQJEt+Rz3xJCMa5so61uDYGtqk/URcPEW1nBa1rfA1QIy/9zv2/aw
2yPSz4NQlv+7yvjguw4Ik5Yt/hGeu1Y8Kuc4bVHG2TB+y0QYwri42bBwQV7llili
XqwfjFJYGMqWJqHGF/p0hD+/Xylw6GnDzxDZQDMNhsuWcfe3tUhuOVkX30E96fh2
ipT4wI5DTmql8EN/r/P7VS2BKL4W5HEMeNEd2APkGNGnSrzKGbd0CQ+cWVZbr645
R03AbqZjhaQKwRi+n82q1mMb4p+3Z/F/T8twHmYg/DLta3kzzRdfVJPNFBFIoSnF
Bay8KJKqoMv2Bjhla78pHMoqSQ9j/fJc2iPIAABtFlsTjic+/STiS7ANAsmDdJtt
8XXc6mFQfbulKKlKKqudPLjOpUNu1SrsOcc9gmovbTy7dN6FBOfJwFMCYonNyXAc
6/ACSxYJlnZ9YEacEmcXmz0GTytyKiTYE3fNsXc/8fHnRZ1+yea9Mo77wWkj7K4V
IqNIJMCW8K+P97oL6mdUBZwMUi4zrWueakMq8SWBdKYaD5yeV2k=
=ql4j
-----END PGP SIGNATURE-----
Merge 4.4.99 into android-4.4
Changes in 4.4.99
mac80211: accept key reinstall without changing anything
mac80211: use constant time comparison with keys
mac80211: don't compare TKIP TX MIC key in reinstall prevention
usb: usbtest: fix NULL pointer dereference
Input: ims-psu - check if CDC union descriptor is sane
ALSA: seq: Cancel pending autoload work at unbinding device
tun/tap: sanitize TUNSETSNDBUF input
tcp: fix tcp_mtu_probe() vs highest_sack
l2tp: check ps->sock before running pppol2tp_session_ioctl()
tun: call dev_get_valid_name() before register_netdevice()
sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
packet: avoid panic in packet_getsockopt()
ipv6: flowlabel: do not leave opt->tot_len with garbage
net/unix: don't show information about sockets from other namespaces
ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
tun: allow positive return values on dev_get_valid_name() call
sctp: reset owner sk for data chunks on out queues when migrating a sock
ppp: fix race in ppp device destruction
ipip: only increase err_count for some certain type icmp in ipip_err
tcp/dccp: fix ireq->opt races
tcp/dccp: fix lockdep splat in inet_csk_route_req()
tcp/dccp: fix other lockdep splats accessing ireq_opt
security/keys: add CONFIG_KEYS_COMPAT to Kconfig
tipc: fix link attribute propagation bug
brcmfmac: remove setting IBSS mode when stopping AP
target/iscsi: Fix iSCSI task reassignment handling
target: Fix node_acl demo-mode + uncached dynamic shutdown regression
misc: panel: properly restore atomic counter on error path
Linux 4.4.99
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d ]
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.
If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.
Bad things would then happen, including infinite loops.
This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.
Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.
Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ac6e780070e30e4c35bd395acfe9191e6268bdd3 ]
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()
Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.
We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq
Many thanks to syzkaller team and Marco for giving us a reproducer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bb1fceca22492109be12640d49f5ea5a544c6bb4 ]
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
(cherry picked from commit bb1fceca22492109be12640d49f5ea5a544c6bb4)
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change-Id: I58bb02d6e4e399612e8580b9e02d11e661df82f5
Bug: 31183296
[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
[cherry-pick of net-next c1e64e298b8cad309091b95d8436a0255c84f54a]
Change-Id: I728a01ef03f2ccfb9016a3f3051ef00975980e49
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default initial rwnd is hardcoded to 10.
Now we allow it to be controlled via
/proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100
This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.
See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>
Conflicts:
net/ipv4/sysctl_net_ipv4.c
net/ipv4/tcp_input.c
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.
Signed-off-by: Brian Swetland <swetland@google.com>
net: fix tcp_v4_nuke_addr
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.
We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix race in tcp_v4_nuke_addr().
To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.
Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.
Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>
ipv4: disable bottom halves around call to tcp_done().
Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr
When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.
This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.
Signed-off-by: Colin Cross <ccross@android.com>
Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.
Only one must win the race, otherwise corruption would occur.
To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.
If request socket was not found, we have to undo the child creation.
This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.
tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.
The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.
Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.
We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is the first half of the RACK loss recovery.
RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.
But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery
RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.
Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.
This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.
Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.
We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.
The second half is implemented in the next patch that marks packet
lost using RACK timestamp.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.
Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.
Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code
The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.
The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.
We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.
Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.
(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)
By attaching the skb to the request socket, we remove this
source of contention.
Tested:
listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.
20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.
ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.
In nominal conditions, this halves pressure on listener lock.
Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.
We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener
get_openreq6() also gets a const for its request socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions do not change the listener socket.
Goal is to make sure tcp_conn_request() is not messing with
listener in a racy way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some common IPv4/IPv6 code can be factorized.
Also constify cookie_init_sequence() socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like
[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>
packets [2] and [3] can be generated at almost the same time.
If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.
S will have to retransmit the answer.
Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.
It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.
This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.
This removes the reorder source at the host level.
We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This documents fact that listener lock might not be held
at the time SYNACK are sent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
listener socket is not locked when tcp_make_synack() is called.
We better make sure no field is written.
There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.
This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.
For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)
For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.
If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().
One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.
We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().
Joint work with Florian Westphal.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.
At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.
We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :
- After cwnd reduction, it is safer to ramp up more slowly,
as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
cwnd every other RTT.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit cd7d8498c9 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.
This patch does the same for gso_size.
This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>