Commit graph

499 commits

Author SHA1 Message Date
Greg Kroah-Hartman
e15716b49f This is the 4.4.152 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlt/64YACgkQONu9yGCS
 aT4/pBAAynguZbVbn8QtYi37Kam0R4ZvXe7rKN8H1A7rwo9l9NJDaC4z2M0Iutfj
 1CfIEOeaf7WtxL25xgvDHQEOfB3/DH0xHbP/DtwqzpT0PmOUqMPaboGqOqXb//1J
 WetcluEOQfoYu1DWofPP1YkAR3vU4Bp40ucAGIN2wE4bvMTR2EMbV8Y5QYgIk6mN
 6n+Smg2Xpkq6paOhIrEt3C1P3lXlpY5Hxd54TGTRQ5c+vccXNldczIcP2Z0wue9/
 LRI8veyY1q/IDhVR8wCrlNb3df6kUQ0xixfTNnTkUJjLs3j+NAsaJiO8/nrdCUhJ
 xQORM7gQIMlccSNanKH0MHoCxhT3iMb8S6Hixvai5O+5XjP03TA7aAZ9Cyp7UqHg
 JY5SPbh7YOmvRXbx7/NAgyLYwRcJRt2PamNRApLQKFbot4bSvNJquhrAib5t6kCF
 HfbXjr9N969gLR4WmGkyOi0IHt8kaVwQitfBLZdj2QdlvyYWXmj0MuJ/I4BuZqtj
 0MyzS/v8cxkN/NWO1p1cB7pRzFtaXtHtC6rxzYXKCUycnHW9cJDf5PBgCfDMqyTY
 SdyuCeMrUo4mNEDItrKF8nbswew1T4UsayvJ6UgKHKr3QaH3Xp1mzeyt1GU38tn1
 ogKm9cVbOuAhnic67ikISFsj8oNptrq0w+Zqe3AKGO8B7CwXwis=
 =Q/T6
 -----END PGP SIGNATURE-----

Merge 4.4.152 into android-4.4

Changes in 4.4.152
	ARC: Explicitly add -mmedium-calls to CFLAGS
	netfilter: ipv6: nf_defrag: reduce struct net memory waste
	selftests: pstore: return Kselftest Skip code for skipped tests
	selftests: static_keys: return Kselftest Skip code for skipped tests
	selftests: user: return Kselftest Skip code for skipped tests
	selftests: zram: return Kselftest Skip code for skipped tests
	selftests: sync: add config fragment for testing sync framework
	ARM: dts: Cygnus: Fix I2C controller interrupt type
	usb: dwc2: fix isoc split in transfer with no data
	usb: gadget: composite: fix delayed_status race condition when set_interface
	usb: gadget: dwc2: fix memory leak in gadget_init()
	scsi: xen-scsifront: add error handling for xenbus_printf
	arm64: make secondary_start_kernel() notrace
	qed: Add sanity check for SIMD fastpath handler.
	enic: initialize enic->rfs_h.lock in enic_probe
	net: hamradio: use eth_broadcast_addr
	net: propagate dev_get_valid_name return code
	ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
	net: davinci_emac: match the mdio device against its compatible if possible
	locking/lockdep: Do not record IRQ state within lockdep code
	ipv6: mcast: fix unsolicited report interval after receiving querys
	Smack: Mark inode instant in smack_task_to_inode
	cxgb4: when disabling dcb set txq dcb priority to 0
	brcmfmac: stop watchdog before detach and free everything
	ARM: dts: am437x: make edt-ft5x06 a wakeup source
	usb: xhci: increase CRS timeout value
	perf test session topology: Fix test on s390
	perf report powerpc: Fix crash if callchain is empty
	selftests/x86/sigreturn/64: Fix spurious failures on AMD CPUs
	ARM: dts: da850: Fix interrups property for gpio
	dmaengine: k3dma: Off by one in k3_of_dma_simple_xlate()
	md/raid10: fix that replacement cannot complete recovery after reassemble
	drm/exynos: gsc: Fix support for NV16/61, YUV420/YVU420 and YUV422 modes
	drm/exynos: decon5433: Fix per-plane global alpha for XRGB modes
	drm/exynos: decon5433: Fix WINCONx reset value
	bnx2x: Fix receiving tx-timeout in error or recovery state.
	m68k: fix "bad page state" oops on ColdFire boot
	HID: wacom: Correct touch maximum XY of 2nd-gen Intuos
	ARM: imx_v6_v7_defconfig: Select ULPI support
	ARM: imx_v4_v5_defconfig: Select ULPI support
	tracing: Use __printf markup to silence compiler
	kasan: fix shadow_size calculation error in kasan_module_alloc
	smsc75xx: Add workaround for gigabit link up hardware errata.
	netfilter: x_tables: set module owner for icmp(6) matches
	ARM: pxa: irq: fix handling of ICMR registers in suspend/resume
	ieee802154: at86rf230: switch from BUG_ON() to WARN_ON() on problem
	ieee802154: at86rf230: use __func__ macro for debug messages
	ieee802154: fakelb: switch from BUG_ON() to WARN_ON() on problem
	drm/armada: fix colorkey mode property
	bnxt_en: Fix for system hang if request_irq fails
	perf llvm-utils: Remove bashism from kernel include fetch script
	ARM: 8780/1: ftrace: Only set kernel memory back to read-only after boot
	ARM: dts: am3517.dtsi: Disable reference to OMAP3 OTG controller
	ixgbe: Be more careful when modifying MAC filters
	packet: reset network header if packet shorter than ll reserved space
	qlogic: check kstrtoul() for errors
	tcp: remove DELAYED ACK events in DCTCP
	drm/nouveau/gem: off by one bugs in nouveau_gem_pushbuf_reloc_apply()
	net/ethernet/freescale/fman: fix cross-build error
	net: usb: rtl8150: demote allmulti message to dev_dbg()
	net: qca_spi: Avoid packet drop during initial sync
	net: qca_spi: Make sure the QCA7000 reset is triggered
	net: qca_spi: Fix log level if probe fails
	tcp: identify cryptic messages as TCP seq # bugs
	staging: android: ion: check for kref overflow
	KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
	ext4: fix spectre gadget in ext4_mb_regular_allocator()
	parisc: Remove ordered stores from syscall.S
	xfrm_user: prevent leaking 2 bytes of kernel memory
	netfilter: conntrack: dccp: treat SYNC/SYNCACK as invalid if no prior state
	packet: refine ring v3 block size test to hold one frame
	bridge: Propagate vlan add failure to user
	parisc: Remove unnecessary barriers from spinlock.h
	PCI: hotplug: Don't leak pci_slot on registration failure
	PCI: Skip MPS logic for Virtual Functions (VFs)
	PCI: pciehp: Fix use-after-free on unplug
	i2c: imx: Fix race condition in dma read
	reiserfs: fix broken xattr handling (heap corruption, bad retval)
	Linux 4.4.152

Change-Id: I1058813031709d20abd0bc45e9ac5fc68ab3a1d7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-08-24 13:37:12 +02:00
Yuchung Cheng
43707aa8c5 tcp: remove DELAYED ACK events in DCTCP
[ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]

After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-24 13:26:59 +02:00
Greg Kroah-Hartman
1396226023 This is the 4.4.146 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltoWioACgkQONu9yGCS
 aT6YrQ//d8dWKaNZK08Z/l2ZqRS56wlNTJyHIB81p1uM2PuPHfLjsZzLQ+HnZ3Ha
 G+fedEj3sbwJp8i61TRu9Q1p/PyLWsnaryWZaK3gm4Yo8GrdVbXAY47EHwz3fbUK
 yxrC0+zQmIlyZgwzbUNGspDuAdNt2MFDug97RFF8BdhJd84Rv0BbicGMwKJQFfFN
 g0Tv6yB+8cjmnCMjmLreLyi+puWvXZtZXAi+idl9eTC4ysGDKNvO1ERptv2NC5C6
 171cbsS/ngpY5ZIUcmLy0QPPFh/ZCeoft22R3gOxZDkjT4Ro6lY5ubPKDEcn57Hv
 FSV5fuQ3cBtmsODn7LMIWqLDKuCRM/gTmvXrWxM91JDLSsuAdZWATj8k4iIoocmk
 l/3iOixBMFCGToQ1I2/O33QZOssKoDIz4bpG6+HM/Cj4anSnVZKjouJSTlNZr/3i
 ZJOXpu/MpQItc/RGo/PumzJLkXhS+HyGwPbTIOPy29NMqp+xvjZv4DttuJbqyHJ2
 Pm/OZcvU7z1wSMhcTknvZLLMQVRIICQjfPJNDefqAdrCdd233cRo37cU8egg4A0l
 F3q+ZI/ny01YWQP8KrCJyWB5lHHbEc44wUHCxet0TPZ1qaqvVcXzaWhwxP2H0L3I
 7r2u9bDg15ielw3jhPpRWZMvANbQlToNoj6YROqj5ArcIowcBPc=
 =7/iL
 -----END PGP SIGNATURE-----

Merge 4.4.146 into android-4.4

Changes in 4.4.146
	MIPS: Fix off-by-one in pci_resource_to_user()
	Input: elan_i2c - add ACPI ID for lenovo ideapad 330
	Input: i8042 - add Lenovo LaVie Z to the i8042 reset list
	Input: elan_i2c - add another ACPI ID for Lenovo Ideapad 330-15AST
	tracing: Fix double free of event_trigger_data
	tracing: Fix possible double free in event_enable_trigger_func()
	tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
	tracing: Quiet gcc warning about maybe unused link variable
	xen/netfront: raise max number of slots in xennet_get_responses()
	ALSA: emu10k1: add error handling for snd_ctl_add
	ALSA: fm801: add error handling for snd_ctl_add
	nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
	mm: vmalloc: avoid racy handling of debugobjects in vunmap
	mm/slub.c: add __printf verification to slab_err()
	rtc: ensure rtc_set_alarm fails when alarms are not supported
	netfilter: ipset: List timing out entries with "timeout 1" instead of zero
	infiniband: fix a possible use-after-free bug
	hvc_opal: don't set tb_ticks_per_usec in udbg_init_opal_common()
	powerpc/64s: Fix compiler store ordering to SLB shadow area
	RDMA/mad: Convert BUG_ONs to error flows
	disable loading f2fs module on PAGE_SIZE > 4KB
	f2fs: fix to don't trigger writeback during recovery
	usbip: usbip_detach: Fix memory, udev context and udev leak
	perf/x86/intel/uncore: Correct fixed counter index check in generic code
	perf/x86/intel/uncore: Correct fixed counter index check for NHM
	iwlwifi: pcie: fix race in Rx buffer allocator
	Bluetooth: hci_qca: Fix "Sleep inside atomic section" warning
	Bluetooth: btusb: Add a new Realtek 8723DE ID 2ff8:b011
	ASoC: dpcm: fix BE dai not hw_free and shutdown
	mfd: cros_ec: Fail early if we cannot identify the EC
	mwifiex: handle race during mwifiex_usb_disconnect
	wlcore: sdio: check for valid platform device data before suspend
	media: videobuf2-core: don't call memop 'finish' when queueing
	btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups
	btrfs: qgroup: Finish rescan when hit the last leaf of extent tree
	PCI: Prevent sysfs disable of device while driver is attached
	ath: Add regulatory mapping for FCC3_ETSIC
	ath: Add regulatory mapping for ETSI8_WORLD
	ath: Add regulatory mapping for APL13_WORLD
	ath: Add regulatory mapping for APL2_FCCA
	ath: Add regulatory mapping for Uganda
	ath: Add regulatory mapping for Tanzania
	ath: Add regulatory mapping for Serbia
	ath: Add regulatory mapping for Bermuda
	ath: Add regulatory mapping for Bahamas
	powerpc/32: Add a missing include header
	powerpc/chrp/time: Make some functions static, add missing header include
	powerpc/powermac: Add missing prototype for note_bootable_part()
	powerpc/powermac: Mark variable x as unused
	powerpc/8xx: fix invalid register expression in head_8xx.S
	pinctrl: at91-pio4: add missing of_node_put
	PCI: pciehp: Request control of native hotplug only if supported
	mwifiex: correct histogram data with appropriate index
	scsi: ufs: fix exception event handling
	ALSA: emu10k1: Rate-limit error messages about page errors
	regulator: pfuze100: add .is_enable() for pfuze100_swb_regulator_ops
	md: fix NULL dereference of mddev->pers in remove_and_add_spares()
	media: smiapp: fix timeout checking in smiapp_read_nvm
	ALSA: usb-audio: Apply rate limit to warning messages in URB complete callback
	HID: hid-plantronics: Re-resend Update to map button for PTT products
	drm/radeon: fix mode_valid's return type
	powerpc/embedded6xx/hlwd-pic: Prevent interrupts from being handled by Starlet
	HID: i2c-hid: check if device is there before really probing
	tty: Fix data race in tty_insert_flip_string_fixed_flag
	dma-iommu: Fix compilation when !CONFIG_IOMMU_DMA
	media: rcar_jpu: Add missing clk_disable_unprepare() on error in jpu_open()
	libata: Fix command retry decision
	media: saa7164: Fix driver name in debug output
	mtd: rawnand: fsl_ifc: fix FSL NAND driver to read all ONFI parameter pages
	brcmfmac: Add support for bcm43364 wireless chipset
	s390/cpum_sf: Add data entry sizes to sampling trailer entry
	perf: fix invalid bit in diagnostic entry
	scsi: 3w-9xxx: fix a missing-check bug
	scsi: 3w-xxxx: fix a missing-check bug
	scsi: megaraid: silence a static checker bug
	thermal: exynos: fix setting rising_threshold for Exynos5433
	bpf: fix references to free_bpf_prog_info() in comments
	media: siano: get rid of __le32/__le16 cast warnings
	drm/atomic: Handling the case when setting old crtc for plane
	ALSA: hda/ca0132: fix build failure when a local macro is defined
	memory: tegra: Do not handle spurious interrupts
	memory: tegra: Apply interrupts mask per SoC
	drm/gma500: fix psb_intel_lvds_mode_valid()'s return type
	ipconfig: Correctly initialise ic_nameservers
	rsi: Fix 'invalid vdd' warning in mmc
	audit: allow not equal op for audit by executable
	microblaze: Fix simpleImage format generation
	usb: hub: Don't wait for connect state at resume for powered-off ports
	crypto: authencesn - don't leak pointers to authenc keys
	crypto: authenc - don't leak pointers to authenc keys
	media: omap3isp: fix unbalanced dma_iommu_mapping
	scsi: scsi_dh: replace too broad "TP9" string with the exact models
	scsi: megaraid_sas: Increase timeout by 1 sec for non-RAID fastpath IOs
	media: si470x: fix __be16 annotations
	drm: Add DP PSR2 sink enable bit
	random: mix rdrand with entropy sent in from userspace
	squashfs: be more careful about metadata corruption
	ext4: fix inline data updates with checksums enabled
	ext4: check for allocation block validity with block group locked
	dmaengine: pxa_dma: remove duplicate const qualifier
	ASoC: pxa: Fix module autoload for platform drivers
	ipv4: remove BUG_ON() from fib_compute_spec_dst
	net: fix amd-xgbe flow-control issue
	net: lan78xx: fix rx handling before first packet is send
	xen-netfront: wait xenbus state change when load module manually
	NET: stmmac: align DMA stuff to largest cache line length
	tcp: do not force quickack when receiving out-of-order packets
	tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
	tcp: do not aggressively quick ack after ECN events
	tcp: refactor tcp_ecn_check_ce to remove sk type cast
	tcp: add one more quick ack after after ECN events
	inet: frag: enforce memory limits earlier
	net: dsa: Do not suspend/resume closed slave_dev
	netlink: Fix spectre v1 gadget in netlink_create()
	squashfs: more metadata hardening
	squashfs: more metadata hardenings
	can: ems_usb: Fix memory leak on ems_usb_disconnect()
	net: socket: fix potential spectre v1 gadget in socketcall
	virtio_balloon: fix another race between migration and ballooning
	kvm: x86: vmx: fix vpid leak
	crypto: padlock-aes - Fix Nano workaround data corruption
	scsi: sg: fix minor memory leak in error path
	Linux 4.4.146

Change-Id: Ia7e43a90d0f5603c741811436b8de41884cb2851
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-08-06 19:12:19 +02:00
Eric Dumazet
2b30c04bc6 tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
[ Upstream commit 9a9c9b51e54618861420093ae6e9b50a961914c5 ]

We want to add finer control of the number of ACK packets sent after
ECN events.

This patch is not changing current behavior, it only enables following
change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06 16:24:41 +02:00
Greg Kroah-Hartman
05670d3d98 This is the 4.4.145 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltcAuIACgkQONu9yGCS
 aT632hAAgtUvSZJZTh6nMePKNys+R3XhqSbJQRqgHsWP6e8gPJ/4A5G7VmoT0STX
 0QG1+J4WscOa+0E2XznYqhEJGisS32skS8VIxfWW1ISPcx4p2tgMrSJdfjxEWiKA
 /7x39msawlcshITTjoRZjV60WzHM2MQgWa24ifOXrxxM+VlLcVSUehyMyYWfrZEt
 hJQtz6iZp3eUvbKopJnCu7iyTFo9RJciSRUmWmYg3CDROn4HJAUgV/NdgDvHmt5J
 +11WAvjQ3RdBSWy7jDadJDqy1BP2r3VdmAS1clxmjCUMsCPeHtOqNlEjc+6FhYoj
 93BNcqKpqPsN2lhuHWCHcZCWLuKA2DW+Rp3l6SvfSpxd55oQeIQEnsLnyCl9XAge
 YhGJZfSd/Ug/fvqlHyqKiv3J3ykCDnq6T4uzyxxmoeFgVq4RvMxSl0u9vMO9CG5u
 jq0Xc19ytvUUNe0ZHXSRPbgUJCEBfWIppgoXuTL4SI/E4hmyhDqUXiSiH+Hjfufc
 tnuTnSSz1CxXHct07sU/kbOTYiVHZmu/eG2Nbx+pG+d48i3/OzdW5EQ5UYvorAb3
 sOkZm5Au5VP/HTJoeW7SLGeZRI0b1SxECJOg5ENmchb8sWV9MjilUoUa408rPin9
 OYYQ7OKA3FHIvxlUCgw6RT6AUZQrwRRY7iAqnR46u26I5Ejif4k=
 =r3bN
 -----END PGP SIGNATURE-----

Merge 4.4.145 into android-4.4

Changes in 4.4.145
	MIPS: ath79: fix register address in ath79_ddr_wb_flush()
	ip: hash fragments consistently
	net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
	rtnetlink: add rtnl_link_state check in rtnl_configure_link
	tcp: fix dctcp delayed ACK schedule
	tcp: helpers to send special DCTCP ack
	tcp: do not cancel delay-AcK on DCTCP special ACK
	tcp: do not delay ACK in DCTCP upon CE status change
	tcp: avoid collapses in tcp_prune_queue() if possible
	tcp: detect malicious patterns in tcp_collapse_ofo_queue()
	ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
	usb: cdc_acm: Add quirk for Castles VEGA3000
	usb: core: handle hub C_PORT_OVER_CURRENT condition
	usb: gadget: f_fs: Only return delayed status when len is 0
	driver core: Partially revert "driver core: correct device's shutdown order"
	can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK
	can: xilinx_can: fix recovery from error states not being propagated
	can: xilinx_can: fix device dropping off bus on RX overrun
	can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
	can: xilinx_can: fix incorrect clear of non-processed interrupts
	can: xilinx_can: fix RX overflow interrupt not being enabled
	turn off -Wattribute-alias
	ARM: fix put_user() for gcc-8
	Linux 4.4.145

Change-Id: I449c110f7f186f2c72c9cc45e00a8deda0d54e40
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-31 20:19:52 +02:00
Yuchung Cheng
255924ea89 tcp: do not delay ACK in DCTCP upon CE status change
[ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]

Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:

""" When receiving packets, the CE codepoint MUST be processed as follows:

   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
       true and send an immediate ACK.

   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
       to false and send an immediate ACK.
"""

Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.

Tested with this packetdrill script provided by Larry Brakmo

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
   +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257

+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001

+0.500 < F. 9501:9501(0) ack 4 win 257

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28 07:45:02 +02:00
Yuchung Cheng
0b1d40e9e7 tcp: do not cancel delay-AcK on DCTCP special ACK
[ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]

Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).

Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.

The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28 07:45:02 +02:00
Greg Kroah-Hartman
79e7124a51 This is the 4.4.123 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlqzaAQACgkQONu9yGCS
 aT6/iw//SWWvNvIYVs9SUcAk1lLRdxIPEJ2mR4CELU8S9WblInX8MS/w3+qGUGVt
 9LzQUfxqbwpX0luel6QnyvME66qQI+VQ0CzA7Qtuzoy2UNlr5woW112nOzkBOuKt
 +KjHg3cUwPijljEZnhPq7eEv6TrHrb/ZsU4jjTLMrlL1u/TYfJQGrU+JAFpTbHto
 9IyVu/F80E/Vln2GHwYidHUbWG70lYIY75J/CXvklbfrxuRpsrOhQla52YspzI2p
 B9Q5p6Ct7FIO1SEBdVdBYLMMBbrt/+dbQzHMRWyUtTudiCSgeQ+8JEWRo6Pu+dGs
 1cW9uCJ+Q7AolQWicG/H80YBZczkB/p77PRCua2vVpx8e/FYSujP3E5T8Mnl7k81
 mmSoNRF8SqYXSwWDCqd+c/ck4zPRKPM++eNpkor+M0QyT9MOfh9oq4vIW4HoZQEm
 j+mZWUNihLO+O7znNGLoQLWn2mzDLjcrfnM6DOaKc5FJht/7OG8yyaobINYoUCmA
 +1cQP0RsRrWo87zpcWoHKRSZ6Iw+YtEu3i7q2bqEdtb2Jc/88nQ5/eYcBLYDaPtw
 DZ8GXL8kxXCP2J2GumwHIBTY68yZPVBknABGo5odqZwfkSgjK5fL17MzWX10Occp
 0U6r+IluHkRCSxcCuyS8w10+TPn/Rw26wJCstPYargfiw/twILA=
 =H2Hj
 -----END PGP SIGNATURE-----

Merge 4.4.123 into android-4.4

Changes in 4.4.123
	blkcg: fix double free of new_blkg in blkcg_init_queue
	Input: tsc2007 - check for presence and power down tsc2007 during probe
	staging: speakup: Replace BUG_ON() with WARN_ON().
	staging: wilc1000: add check for kmalloc allocation failure.
	HID: reject input outside logical range only if null state is set
	drm: qxl: Don't alloc fbdev if emulation is not supported
	ath10k: fix a warning during channel switch with multiple vaps
	PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
	selinux: check for address length in selinux_socket_bind()
	perf sort: Fix segfault with basic block 'cycles' sort dimension
	i40e: Acquire NVM lock before reads on all devices
	i40e: fix ethtool to get EEPROM data from X722 interface
	perf tools: Make perf_event__synthesize_mmap_events() scale
	drivers: net: xgene: Fix hardware checksum setting
	drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
	ath10k: disallow DFS simulation if DFS channel is not enabled
	perf probe: Return errno when not hitting any event
	HID: clamp input to logical range if no null state
	net/8021q: create device with all possible features in wanted_features
	ARM: dts: Adjust moxart IRQ controller and flags
	batman-adv: handle race condition for claims between gateways
	of: fix of_device_get_modalias returned length when truncating buffers
	solo6x10: release vb2 buffers in solo_stop_streaming()
	scsi: ipr: Fix missed EH wakeup
	media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
	timers, sched_clock: Update timeout for clock wrap
	sysrq: Reset the watchdog timers while displaying high-resolution timers
	Input: qt1070 - add OF device ID table
	sched: act_csum: don't mangle TCP and UDP GSO packets
	ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
	spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
	tcp: sysctl: Fix a race to avoid unexpected 0 window from space
	dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
	driver: (adm1275) set the m,b and R coefficients correctly for power
	mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
	blk-throttle: make sure expire time isn't too big
	f2fs: relax node version check for victim data in gc
	bonding: refine bond_fold_stats() wrap detection
	braille-console: Fix value returned by _braille_console_setup
	drm/vmwgfx: Fixes to vmwgfx_fb
	vxlan: vxlan dev should inherit lowerdev's gso_max_size
	NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
	NFC: nfcmrvl: double free on error path
	ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
	ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
	powerpc: Avoid taking a data miss on every userspace instruction miss
	net/faraday: Add missing include of of.h
	ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
	reiserfs: Make cancel_old_flush() reliable
	ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
	fm10k: correctly check if interface is removed
	scsi: ses: don't get power status of SES device slot on probe
	apparmor: Make path_max parameter readonly
	iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
	video: ARM CLCD: fix dma allocation size
	drm/radeon: Fail fb creation from imported dma-bufs.
	drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
	coresight: Fixes coresight DT parse to get correct output port ID.
	MIPS: BPF: Quit clobbering callee saved registers in JIT code.
	MIPS: BPF: Fix multiple problems in JIT skb access helpers.
	MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
	MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
	regulator: isl9305: fix array size
	md/raid6: Fix anomily when recovering a single device in RAID6.
	usb: dwc2: Make sure we disconnect the gadget state
	usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
	drivers/perf: arm_pmu: handle no platform_device
	perf inject: Copy events when reordering events in pipe mode
	perf session: Don't rely on evlist in pipe mode
	scsi: sg: check for valid direction before starting the request
	scsi: sg: close race condition in sg_remove_sfp_usercontext()
	kprobes/x86: Fix kprobe-booster not to boost far call instructions
	kprobes/x86: Set kprobes pages read-only
	pwm: tegra: Increase precision in PWM rate calculation
	wil6210: fix memory access violation in wil_memcpy_from/toio_32
	drm/edid: set ELD connector type in drm_edid_to_eld()
	video/hdmi: Allow "empty" HDMI infoframes
	HID: elo: clear BTN_LEFT mapping
	ARM: dts: exynos: Correct Trats2 panel reset line
	sched: Stop switched_to_rt() from sending IPIs to offline CPUs
	sched: Stop resched_cpu() from sending IPIs to offline CPUs
	test_firmware: fix setting old custom fw path back on exit
	net: xfrm: allow clearing socket xfrm policies.
	mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
	ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
	ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
	ath10k: update tdls teardown state to target
	cpufreq: Fix governor module removal race
	clk: qcom: msm8916: fix mnd_width for codec_digcodec
	ath10k: fix invalid STS_CAP_OFFSET_MASK
	tools/usbip: fixes build with musl libc toolchain
	spi: sun6i: disable/unprepare clocks on remove
	scsi: core: scsi_get_device_flags_keyed(): Always return device flags
	scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
	scsi: dh: add new rdac devices
	media: cpia2: Fix a couple off by one bugs
	veth: set peer GSO values
	drm/amdkfd: Fix memory leaks in kfd topology
	agp/intel: Flush all chipset writes after updating the GGTT
	mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
	mac80211: remove BUG() when interface type is invalid
	ASoC: nuc900: Fix a loop timeout test
	ipvlan: add L2 check for packets arriving via virtual devices
	rcutorture/configinit: Fix build directory error message
	ima: relax requiring a file signature for new files with zero length
	selftests/x86/entry_from_vm86: Exit with 1 if we fail
	selftests/x86: Add tests for User-Mode Instruction Prevention
	selftests/x86: Add tests for the STR and SLDT instructions
	selftests/x86/entry_from_vm86: Add test cases for POPF
	x86/vm86/32: Fix POPF emulation
	x86/mm: Fix vmalloc_fault to use pXd_large
	ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
	ALSA: hda - Revert power_save option default value
	ALSA: seq: Fix possible UAF in snd_seq_check_queue()
	ALSA: seq: Clear client entry before deleting else at closing
	drm/amdgpu/dce: Don't turn off DP sink when disconnected
	fs: Teach path_connected to handle nfs filesystems with multiple roots.
	lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
	fs/aio: Add explicit RCU grace period when freeing kioctx
	fs/aio: Use RCU accessors for kioctx_table->table[]
	irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
	scsi: sg: fix SG_DXFER_FROM_DEV transfers
	scsi: sg: fix static checker warning in sg_is_valid_dxfer
	scsi: sg: only check for dxfer_len greater than 256M
	ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux
	btrfs: alloc_chunk: fix DUP stripe size handling
	btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
	USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
	usb: gadget: bdc: 64-bit pointer capability check
	bpf: fix incorrect sign extension in check_alu_op()
	Linux 4.4.123

Change-Id: Ieb89411248f93522dde29edb8581f8ece22e33a7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-22 09:57:28 +01:00
Gao Feng
37633f021b tcp: sysctl: Fix a race to avoid unexpected 0 window from space
[ Upstream commit c48367427a39ea0b85c7cf018fe4256627abfd9e ]

Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)

As a result, tcp_win_from_space returns 0. It is unexpected.

Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22 09:23:22 +01:00
Greg Kroah-Hartman
7eab308a49 This is the 4.4.99 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloQBz0ACgkQONu9yGCS
 aT5nQxAAs/xWKpYLSLvLPYnTOmSmNJ36isgFriVT+wWOLzkrWTWuoQnluDjMQjie
 nH6whZMlOnG+k5GrGF3XymxZ66tDj9TlXnPAHCC8ikcqir2/dBO/gO5v2gmFgF2E
 j52mt09I3acBQJEt+Rz3xJCMa5so61uDYGtqk/URcPEW1nBa1rfA1QIy/9zv2/aw
 2yPSz4NQlv+7yvjguw4Ik5Yt/hGeu1Y8Kuc4bVHG2TB+y0QYwri42bBwQV7llili
 XqwfjFJYGMqWJqHGF/p0hD+/Xylw6GnDzxDZQDMNhsuWcfe3tUhuOVkX30E96fh2
 ipT4wI5DTmql8EN/r/P7VS2BKL4W5HEMeNEd2APkGNGnSrzKGbd0CQ+cWVZbr645
 R03AbqZjhaQKwRi+n82q1mMb4p+3Z/F/T8twHmYg/DLta3kzzRdfVJPNFBFIoSnF
 Bay8KJKqoMv2Bjhla78pHMoqSQ9j/fJc2iPIAABtFlsTjic+/STiS7ANAsmDdJtt
 8XXc6mFQfbulKKlKKqudPLjOpUNu1SrsOcc9gmovbTy7dN6FBOfJwFMCYonNyXAc
 6/ACSxYJlnZ9YEacEmcXmz0GTytyKiTYE3fNsXc/8fHnRZ1+yea9Mo77wWkj7K4V
 IqNIJMCW8K+P97oL6mdUBZwMUi4zrWueakMq8SWBdKYaD5yeV2k=
 =ql4j
 -----END PGP SIGNATURE-----

Merge 4.4.99 into android-4.4

Changes in 4.4.99
	mac80211: accept key reinstall without changing anything
	mac80211: use constant time comparison with keys
	mac80211: don't compare TKIP TX MIC key in reinstall prevention
	usb: usbtest: fix NULL pointer dereference
	Input: ims-psu - check if CDC union descriptor is sane
	ALSA: seq: Cancel pending autoload work at unbinding device
	tun/tap: sanitize TUNSETSNDBUF input
	tcp: fix tcp_mtu_probe() vs highest_sack
	l2tp: check ps->sock before running pppol2tp_session_ioctl()
	tun: call dev_get_valid_name() before register_netdevice()
	sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
	packet: avoid panic in packet_getsockopt()
	ipv6: flowlabel: do not leave opt->tot_len with garbage
	net/unix: don't show information about sockets from other namespaces
	ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
	tun: allow positive return values on dev_get_valid_name() call
	sctp: reset owner sk for data chunks on out queues when migrating a sock
	ppp: fix race in ppp device destruction
	ipip: only increase err_count for some certain type icmp in ipip_err
	tcp/dccp: fix ireq->opt races
	tcp/dccp: fix lockdep splat in inet_csk_route_req()
	tcp/dccp: fix other lockdep splats accessing ireq_opt
	security/keys: add CONFIG_KEYS_COMPAT to Kconfig
	tipc: fix link attribute propagation bug
	brcmfmac: remove setting IBSS mode when stopping AP
	target/iscsi: Fix iSCSI task reassignment handling
	target: Fix node_acl demo-mode + uncached dynamic shutdown regression
	misc: panel: properly restore atomic counter on error path
	Linux 4.4.99

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-11-18 17:24:24 +01:00
Eric Dumazet
71c4a0fc35 tcp: fix tcp_mtu_probe() vs highest_sack
[ Upstream commit 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d ]

Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.

If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.

Bad things would then happen, including infinite loops.

This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.

Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.

Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-18 11:11:05 +01:00
Dmitry Shmidt
84dc474f3c This is the 4.4.34 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYMrk7AAoJEDjbvchgkmk+MuIP/ApODvom/RcmEk2cyStEbedT
 wrFgzAXF5diBNayg2sVDoNjCVRhY7dYZN1WI1ItR929gkGHDxkwk0LmRsv5K9fLK
 jflWzZc9OrChYiJKPt9x4xuvDBXJAhtT4D4czZI1ZJhfrYhq0EDkxQB23J5WCsl/
 aCVBqe57fEf5C2crsPdS8V9UmPyYgrtV74PB/EMkTLbYlPU50AuWOjUWrVPMbMTf
 lbslyN82OD/09SuYhsKLw87CAOsJPVunSOE99w+jk83j6p2nCVVKFFxvFkON6rjw
 zZHCZ657oXYM0jeoSN/y+KuP4TTlYkl3LTD6EekzWDj9GtcDCOMnWqIA128a7gpJ
 /ME7piqf1/ypfUaUyQa6eM1U0QYHsRXnJ6jnuRAJ54qQlk4FTjRLPhQbVKqejdVb
 +5vDRXL3GhFPEc5zg8x0j+sTlVj/spcqQDx71t2G9UFKHdh4IhLJuUUySeA/BeTh
 bTimCMD6oG+q+WQnKB8oNSFokbmIabvj/pGLkdAw2Iji/P6JEsMzLIHQhZVSPcUb
 oqWIj5ZJPkad6Xs1kpaJUoD1o+GUxwnzXjCZ20uxHhQ2DChpta8mFE/W6IgN3QvH
 Kj7MWpxZohu37te0XaegAZoJAimMAO9YWYSpiMIcYfN6+3xxExN90aHBlIk1NHlu
 NTaNcHKOEgeoV0E+X3oK
 =jOxx
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.34' into android-4.4.y

This is the 4.4.34 stable release

Change-Id: Ic90323945584a7173f54595e0482d26fafd10174
2016-11-21 10:56:25 -08:00
Eric Dumazet
225a24ae97 tcp: take care of truncations done by sk_filter()
[ Upstream commit ac6e780070e30e4c35bd395acfe9191e6268bdd3 ]

With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-21 10:06:40 +01:00
Eric Dumazet
0f55fa7541 tcp: fix use after free in tcp_xmit_retransmit_queue()
[ Upstream commit bb1fceca22492109be12640d49f5ea5a544c6bb4 ]

When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()

Then it attempts to copy user data into this fresh skb.

If the copy fails, we undo the work and remove the fresh skb.

Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)

Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.

This bug was found by Marco Grassi thanks to syzkaller.

Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
2016-09-30 10:18:34 +02:00
Dmitry Shmidt
cade80573c Merge remote-tracking branch 'common/android-4.4' into android-4.4.y 2016-09-07 14:37:52 -07:00
Eric Dumazet
d25e3081ba UPSTREAM: tcp: fix use after free in tcp_xmit_retransmit_queue()
(cherry picked from commit bb1fceca22492109be12640d49f5ea5a544c6bb4)

When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()

Then it attempts to copy user data into this fresh skb.

If the copy fails, we undo the work and remove the fresh skb.

Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)

Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.

This bug was found by Marco Grassi thanks to syzkaller.

Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change-Id: I58bb02d6e4e399612e8580b9e02d11e661df82f5
Bug: 31183296
2016-09-02 13:44:25 -07:00
Dmitry Shmidt
b558f17a13 This is the 4.4.16 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXmOXmAAoJEDjbvchgkmk+QYIP/1S8oBZsvjfDzvH8t63HyLeH
 i43MFlYoFAqUIZc002XpluSvZ8uHoG+r7R8Hq3wmv48wxe3M6OBnMdBVTht6mPw+
 t5OLTZr40lWaJm2EIi4aekueMIrCgmL+Et+IFYv7ZVBuYLteVcfny+zdq4EqGmgj
 /a19+L/sTTr4SHtJIhHxWhiVJ9fVMgQk/N3VgQmIiNF2+lVbiFI7QQiDPLbFl0KK
 CM4ETO22HxHCYilGpzhpSMsHCxv12VqNaXNLAsPAepGGW7PqvUmrEWAqgwsbOfRc
 GxTLNk0dUgJqMrfEpQ8ZOMlgzvCAYG2jZuNSuT+nuzrWSUP+WOGRi9TTTxp1CYuZ
 PHlhNTH7ZnqosxJUUZS2d9N5ygpqD48Rhlfl824YzOWCy94VeUnedkVLb20uJwPF
 Y5aQ5WjktBC9why5e4OgGQERvx/U9KTk8E1zRfZZPc2oft9My0YxuemjjKAKZiYN
 ne4WhXbgOJTQkAoZwh2xqny3bWyEaoSrWpQ3R7bBJ9SIRLEOdCKzKpduDbAnbMP7
 QWgQOQC/6qA1mKqjrqF4KPA1Quo9PcUK2Ajh523ewMGCowgY90vyejAgh4Q8g0GC
 fKlx+jJDoKVDbQ8v4hc9PPHMsNNIKT9a1ptwVS3lE+bq1D5Ffm57A4/uOTMYHVab
 gKqu8h1CA0MCVBsH3nNA
 =nY8S
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.16' into android-4.4.y

This is the 4.4.16 stable release

Change-Id: Ibaf7b7e03695e1acebc654a2ca1a4bfcc48fcea4
2016-08-01 15:57:55 -07:00
Dmitry Shmidt
8591a83cae Revert "net: socket ioctl to reset connections matching local address"
Use SOCK_DESTROY from now instead of SIOCKILLADDR

This reverts commit 38f0ec724f.

Change-Id: I2dcd833b66c88a48de8978dce9d72ab78f9af549
2016-04-22 11:04:14 -07:00
Eric Dumazet
2679161c77 tcp: do not drop syn_recv on all icmp reports
[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]

Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:05 -08:00
Lorenzo Colitti
79170d8d5d net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.

[cherry-pick of net-next c1e64e298b8cad309091b95d8436a0255c84f54a]

Change-Id: I728a01ef03f2ccfb9016a3f3051ef00975980e49
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
JP Abgrall
72fc8653dc tcp: add a sysctl to config the tcp_default_init_rwnd
The default initial rwnd is hardcoded to 10.

Now we allow it to be controlled via
  /proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100

This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.

See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf

Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>

Conflicts:
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/tcp_input.c
2016-02-16 13:51:43 -08:00
Robert Love
38f0ec724f net: socket ioctl to reset connections matching local address
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.

Signed-off-by: Brian Swetland <swetland@google.com>

net: fix tcp_v4_nuke_addr

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.

We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix race in tcp_v4_nuke_addr().

To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.

Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.

Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>

ipv4: disable bottom halves around call to tcp_done().

Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>

ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr

When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close.  When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.

This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.

Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:51:15 -08:00
Eric Dumazet
5e0724d027 tcp/dccp: fix hashdance race for passive sessions
Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 05:42:21 -07:00
Yuchung Cheng
4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng
659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng
f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Eric Dumazet
dc6ef6be52 tcp: do not set queue_mapping on SYNACK
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:26:02 -07:00
Eric Dumazet
ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet
aa3a0c8ce6 tcp: get_openreq[46]() changes
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet
f964629e33 tcp: constify tcp_v{4|6}_route_req() sock argument
These functions do not change the listener socket.
Goal is to make sure tcp_conn_request() is not messing with
listener in a racy way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
3f684b4b1f tcp: cookie_init_sequence() cleanups
Some common IPv4/IPv6 code can be factorized.
Also constify cookie_init_sequence() socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
0c27171e66 tcp/dccp: constify syn_recv_sock() method sock argument
We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
c28c6f0459 tcp: constify tcp_create_openreq_child() socket argument
This method does not touch the listener socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
72ab4a86f7 tcp: remove tcp_rcv_state_process() tcp_hdr argument
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet
bda07a64c0 tcp: remove unused len argument from tcp_rcv_state_process()
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet
7c85af8810 tcp: avoid reorders for TFO passive connections
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like

[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>

packets [2] and [3] can be generated at almost the same time.

If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.

S will have to retransmit the answer.

Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.

It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.

This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.

This removes the reorder source at the host level.

We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:11:19 -07:00
Eric Dumazet
ea3bea3a1d tcp/dccp: constify rtx_synack() and friends
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
0f935dbedc tcp: constify tcp_v{4|6}_send_synack() socket argument
This documents fact that listener lock might not be held
at the time SYNACK are sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
5d062de7f8 tcp: constify tcp_make_synack() socket argument
listener socket is not locked when tcp_make_synack() is called.

We better make sure no field is written.

There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
b83e3deb97 tcp: md5: constify tcp_md5_do_lookup() socket argument
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
b1964b5fce tcp: constify tcp_openreq_init_rwin()
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Eric Dumazet
b40cf18ef7 tcp: constify listener socket in tcp_v[46]_init_req()
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Yuchung Cheng
0f1c28ae74 tcp: usec resolution SYN/ACK RTT
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.

This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.

For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)

For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.

If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().

One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Daniel Borkmann
c3a8d94746 tcp: use dctcp if enabled on the route to the initiator
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Eric Dumazet
43e122b014 tcp: refine pacing rate determination
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.

At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.

We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :

- After cwnd reduction, it is safer to ramp up more slowly,
  as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
  cwnd every other RTT.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:33:54 -07:00
Eric Dumazet
6f021c62d6 tcp: fix slow start after idle vs TSO/GSO
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.

With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.

Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.

As Neal pointed out, we also need to perform the check
if/when receive window opens.

Tested:

Following packetdrill test demonstrates the problem
// Test of slow start after idle

`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0    accept(3, ..., ...) = 4
+0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0

+0    write(4, ..., 26000) = 26000
+0    > . 1:5001(5000) ack 1
+0    > . 5001:10001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

+.100 < . 1:1(0) ack 10001 win 511
+0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0    > . 10001:20001(10000) ack 1
+0    > P. 20001:26001(6000) ack 1

+.100 < . 1:1(0) ack 26001 win 511
+0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%

+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0    > . 26001:31001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0    > . 31001:36001(5000) ack 1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:22:50 -07:00
Yuchung Cheng
76174004a0 tcp: do not slow start when cwnd equals ssthresh
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Yuchung Cheng
071d5080e3 tcp: add tcp_in_slow_start helper
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Eric Dumazet
f69ad292cf tcp: fill shinfo->gso_size at last moment
In commit cd7d8498c9 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.

This patch does the same for gso_size.

This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11 16:33:11 -07:00