Commit graph

702 commits

Author SHA1 Message Date
Greg Kroah-Hartman
7a77ef209c Fix backport of "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
Based on review from Eric Dumazet, my backport of commit
3d4bf93ac12003f9b8e1e2de37fe27983deebdcf to older kernels was a bit
incorrect.  This patch fixes this.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-27 14:02:33 +02:00
Eric Dumazet
792e682a47 tcp: detect malicious patterns in tcp_collapse_ofo_queue()
[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-27 12:33:59 +02:00
Eric Dumazet
9fa2a49a4a tcp: avoid collapses in tcp_prune_queue() if possible
[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
	tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-27 12:33:58 +02:00
Greg Kroah-Hartman
7ba5557097 This is the 4.4.139 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAls7QB8ACgkQONu9yGCS
 aT6trQ/9EO1dgc0lZO0zGCxFFiikPzzMp1auSKd99FhSaqlrCPutT5K0gBVc1rug
 EvggbqWj2MBX2HZvxQR8LbGNvp7+kkM3apIdYOqyTQPvs7x03YNeuvXZUF3EFyPO
 eDZ71nLuwgnEeySceJ+Z9HcVBcWR/0dEkwjhjpIJ2IO25tcecWzbqOOdzNypBIKK
 EG4dGhO5JY6jLqxbEFZ9d302bGZQozOQHiDfEZz6NueI0yYVJIjQQvuLp/V0ChDg
 TN+PgTOdzxIPCpZw9y4XzN4nhdsOial1xeX7agzAkZDjdbprNpbZrxjfY0NLdpQ0
 4ZV3vLqIZ5rs8xuCRgNJ7yTVt6X7miw/h7TQp30qpeDuRf1SHZa4ITqMzdXJUahW
 BT+XkjrrCjKxXkCH+rWy0txtouUaVwM+sKHIW0bvrOJwHM0UJXNAUppt4NrBtgtD
 7Zt/FDKAHCJk1GuW3U5zXOHmgn+QkRNEndpwbUjwRowvHcE5jVSLLkH4XZkA0+SL
 ucQCxOqGKrbHjhyXT+e2Kpx4Z5sqJIUHhc4iw6gi7xyaoJ55kHZ2S+sCwo3cjreq
 B43SrwkQ0EJXwHzcrmvDfnvEFf7ylDVWH597lQsIQMNI7Gg04fXixYpvr6DYOBSN
 AKHvoqd7VztHnX/ZogyLXp4jWiU5dU6qYXdj/zEs+tB8DYPZ4+c=
 =Mli0
 -----END PGP SIGNATURE-----

Merge 4.4.139 into android-4.4

Changes in 4.4.139
	xfrm6: avoid potential infinite loop in _decode_session6()
	netfilter: ebtables: handle string from userspace with care
	ipvs: fix buffer overflow with sync daemon and service
	atm: zatm: fix memcmp casting
	net: qmi_wwan: Add Netgear Aircard 779S
	net/sonic: Use dma_mapping_error()
	Revert "Btrfs: fix scrub to repair raid6 corruption"
	tcp: do not overshoot window_clamp in tcp_rcv_space_adjust()
	Btrfs: make raid6 rebuild retry more
	usb: musb: fix remote wakeup racing with suspend
	bonding: re-evaluate force_primary when the primary slave name changes
	tcp: verify the checksum of the first data segment in a new connection
	ext4: update mtime in ext4_punch_hole even if no blocks are released
	ext4: fix fencepost error in check for inode count overflow during resize
	driver core: Don't ignore class_dir_create_and_add() failure.
	btrfs: scrub: Don't use inode pages for device replace
	ALSA: hda - Handle kzalloc() failure in snd_hda_attach_pcm_stream()
	ALSA: hda: add dock and led support for HP EliteBook 830 G5
	ALSA: hda: add dock and led support for HP ProBook 640 G4
	cpufreq: Fix new policy initialization during limits updates via sysfs
	libata: zpodd: make arrays cdb static, reduces object code size
	libata: zpodd: small read overflow in eject_tray()
	libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk
	w1: mxc_w1: Enable clock before calling clk_get_rate() on it
	fs/binfmt_misc.c: do not allow offset overflow
	x86/spectre_v1: Disable compiler optimizations over array_index_mask_nospec()
	m68k/mm: Adjust VM area to be unmapped by gap size for __iounmap()
	serial: sh-sci: Use spin_{try}lock_irqsave instead of open coding version
	signal/xtensa: Consistenly use SIGBUS in do_unaligned_user
	usb: do not reset if a low-speed or full-speed device timed out
	1wire: family module autoload fails because of upper/lower case mismatch.
	ASoC: dapm: delete dapm_kcontrol_data paths list before freeing it
	ASoC: cirrus: i2s: Fix LRCLK configuration
	ASoC: cirrus: i2s: Fix {TX|RX}LinCtrlData setup
	lib/vsprintf: Remove atomic-unsafe support for %pCr
	mips: ftrace: fix static function graph tracing
	branch-check: fix long->int truncation when profiling branches
	ipmi:bt: Set the timeout before doing a capabilities check
	Bluetooth: hci_qca: Avoid missing rampatch failure with userspace fw loader
	fuse: atomic_o_trunc should truncate pagecache
	fuse: don't keep dead fuse_conn at fuse_fill_super().
	fuse: fix control dir setup and teardown
	powerpc/mm/hash: Add missing isync prior to kernel stack SLB switch
	powerpc/ptrace: Fix setting 512B aligned breakpoints with PTRACE_SET_DEBUGREG
	powerpc/ptrace: Fix enforcement of DAWR constraints
	cpuidle: powernv: Fix promotion from snooze if next state disabled
	powerpc/fadump: Unregister fadump on kexec down path.
	ARM: 8764/1: kgdb: fix NUMREGBYTES so that gdb_regs[] is the correct size
	of: unittest: for strings, account for trailing \0 in property length field
	IB/qib: Fix DMA api warning with debug kernel
	RDMA/mlx4: Discard unknown SQP work requests
	mtd: cfi_cmdset_0002: Change write buffer to check correct value
	mtd: cfi_cmdset_0002: Use right chip in do_ppb_xxlock()
	mtd: cfi_cmdset_0002: fix SEGV unlocking multiple chips
	mtd: cfi_cmdset_0002: Fix unlocking requests crossing a chip boudary
	mtd: cfi_cmdset_0002: Avoid walking all chips when unlocking.
	MIPS: BCM47XX: Enable 74K Core ExternalSync for PCIe erratum
	PCI: pciehp: Clear Presence Detect and Data Link Layer Status Changed on resume
	MIPS: io: Add barrier after register read in inX()
	time: Make sure jiffies_to_msecs() preserves non-zero time periods
	Btrfs: fix clone vs chattr NODATASUM race
	iio:buffer: make length types match kfifo types
	scsi: qla2xxx: Fix setting lower transfer speed if GPSC fails
	scsi: zfcp: fix missing SCSI trace for result of eh_host_reset_handler
	scsi: zfcp: fix missing SCSI trace for retry of abort / scsi_eh TMF
	scsi: zfcp: fix misleading REC trigger trace where erp_action setup failed
	scsi: zfcp: fix missing REC trigger trace on terminate_rport_io early return
	scsi: zfcp: fix missing REC trigger trace on terminate_rport_io for ERP_FAILED
	scsi: zfcp: fix missing REC trigger trace for all objects in ERP_FAILED
	scsi: zfcp: fix missing REC trigger trace on enqueue without ERP thread
	linvdimm, pmem: Preserve read-only setting for pmem devices
	md: fix two problems with setting the "re-add" device state.
	ubi: fastmap: Cancel work upon detach
	UBIFS: Fix potential integer overflow in allocation
	xfrm: Ignore socket policies when rebuilding hash tables
	xfrm: skip policies marked as dead while rehashing
	backlight: as3711_bl: Fix Device Tree node lookup
	backlight: max8925_bl: Fix Device Tree node lookup
	backlight: tps65217_bl: Fix Device Tree node lookup
	mfd: intel-lpss: Program REMAP register in PIO mode
	perf tools: Fix symbol and object code resolution for vdso32 and vdsox32
	perf intel-pt: Fix sync_switch INTEL_PT_SS_NOT_TRACING
	perf intel-pt: Fix decoding to accept CBR between FUP and corresponding TIP
	perf intel-pt: Fix MTC timing after overflow
	perf intel-pt: Fix "Unexpected indirect branch" error
	perf intel-pt: Fix packet decoding of CYC packets
	media: v4l2-compat-ioctl32: prevent go past max size
	media: cx231xx: Add support for AverMedia DVD EZMaker 7
	media: dvb_frontend: fix locking issues at dvb_frontend_get_event()
	nfsd: restrict rd_maxcount to svc_max_payload in nfsd_encode_readdir
	NFSv4: Fix possible 1-byte stack overflow in nfs_idmap_read_and_verify_message
	video: uvesafb: Fix integer overflow in allocation
	Input: elan_i2c - add ELAN0618 (Lenovo v330 15IKB) ACPI ID
	xen: Remove unnecessary BUG_ON from __unbind_from_irq()
	udf: Detect incorrect directory size
	Input: elan_i2c_smbus - fix more potential stack buffer overflows
	Input: elantech - enable middle button of touchpads on ThinkPad P52
	Input: elantech - fix V4 report decoding for module with middle key
	ALSA: hda/realtek - Add a quirk for FSC ESPRIMO U9210
	Btrfs: fix unexpected cow in run_delalloc_nocow
	spi: Fix scatterlist elements size in spi_map_buf
	block: Fix transfer when chunk sectors exceeds max
	dm thin: handle running out of data space vs concurrent discard
	cdc_ncm: avoid padding beyond end of skb
	Bluetooth: Fix connection if directed advertising and privacy is used
	Linux 4.4.139

Change-Id: I93013bedf2ebe3e6a8718972d8854723609963cc
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-03 18:23:34 +02:00
Eric Dumazet
4dff97920e tcp: do not overshoot window_clamp in tcp_rcv_space_adjust()
commit 02db55718d53f9d426cee504c27fb768e9ed4ffe upstream.

While rcvbuf is properly clamped by tcp_rmem[2], rcvwin
is left to a potentially too big value.

It has no serious effect, since :
1) tcp_grow_window() has very strict checks.
2) window_clamp can be mangled by user space to any value anyway.

tcp_init_buffer_space() and companions use tcp_full_space(),
we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03 11:21:24 +02:00
Greg Kroah-Hartman
fb7e319634 This is the 4.4.136 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlsX88AACgkQONu9yGCS
 aT4fEBAAygf8GZqR8ql76DdEBREkgTgGrne2+Rq56eylWZyycU2FpZVLe2ct7yjf
 rbF2XCtxdPmia++z0WvmslDbtUeqSSPOz1jZBEERmyZpjpOkDTwsMUfz75Gvpi83
 ZJS4KXseL9W/jrSyIAbHJ4Fq1ffmoWzN8mEepde26Ic2DJ/3mB2Dphgg95UjI7rw
 KGg3+Jjr21ojrEmI1BOVItgZ6iU0jTgCkwrYrP1eI+OzRjasGMMJRh/HYBfr3GEY
 N6Ggi5PyIWF/DOeTp53hajOAFbt5WTFK6hiiwLqz+6XQuhY45N1YuXgT/vszZmKz
 nngD5p5+GWKZoXtRXoLMXts8EdZ55yoyj6dkIOM5W62C3HhxjqpPrLXJMdtm5eO/
 tL8/vbB6AzniFB/hQS4IqfqQ6sizcAzGi/vP0eOW2I7K9WIsbXR9vt1BcvVaIrRF
 O/9xX4QJrceNIUzq25sdS7vv4fk7O0AUq/bZtYWWjKY+4E2LhAPoHgmB7cF/M8jJ
 K8BtMtClyDqfpIhJiH3PDYdY6jRfYKcNUhMZLBYN9uRwa/5l8cC4AIKBEY8IyhgB
 i05G8YadInSSqf2eRGZ97Qpn5MVYm2G/r2BtpNLbCfIYUfvnHD7mWfteVjVw4Yjh
 Q6ERVHkvjEFsn1BPBd34OMVJlDz0oqNT92NwiAlXiA4Sxizvvh4=
 =0oNX
 -----END PGP SIGNATURE-----

Merge 4.4.136 into android-4.4

Changes in 4.4.136
	arm64: lse: Add early clobbers to some input/output asm operands
	powerpc/64s: Clear PCR on boot
	USB: serial: cp210x: use tcflag_t to fix incompatible pointer type
	sh: New gcc support
	xfs: detect agfl count corruption and reset agfl
	Revert "ima: limit file hash setting by user to fix and log modes"
	Input: elan_i2c_smbus - fix corrupted stack
	tracing: Fix crash when freeing instances with event triggers
	selinux: KASAN: slab-out-of-bounds in xattr_getsecurity
	cfg80211: further limit wiphy names to 64 bytes
	rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
	ASoC: Intel: sst: remove redundant variable dma_dev_name
	irda: fix overly long udelay()
	tcp: avoid integer overflows in tcp_rcv_space_adjust()
	i2c: rcar: make sure clocks are on when doing clock calculation
	i2c: rcar: rework hw init
	i2c: rcar: remove unused IOERROR state
	i2c: rcar: remove spinlock
	i2c: rcar: refactor setup of a msg
	i2c: rcar: init new messages in irq
	i2c: rcar: don't issue stop when HW does it automatically
	i2c: rcar: check master irqs before slave irqs
	i2c: rcar: revoke START request early
	dmaengine: usb-dmac: fix endless loop in usb_dmac_chan_terminate_all()
	iio:kfifo_buf: check for uint overflow
	MIPS: ptrace: Fix PTRACE_PEEKUSR requests for 64-bit FGRs
	MIPS: prctl: Disallow FRE without FR with PR_SET_FP_MODE requests
	scsi: scsi_transport_srp: Fix shost to rport translation
	stm class: Use vmalloc for the master map
	hwtracing: stm: fix build error on some arches
	drm/i915: Disable LVDS on Radiant P845
	Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
	fix io_destroy()/aio_complete() race
	mm: fix the NULL mapping case in __isolate_lru_page()
	sparc64: Fix build warnings with gcc 7.
	Linux 4.4.136

Change-Id: I3457f995cf22c65952271ecd517a46144ac4dc79
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-06-06 18:53:06 +02:00
Eric Dumazet
70741861fc tcp: avoid integer overflows in tcp_rcv_space_adjust()
commit 607065bad9931e72207b0cac365d7d4abc06bd99 upstream.

When using large tcp_rmem[2] values (I did tests with 500 MB),
I noticed overflows while computing rcvwin.

Lets fix this before the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Backport: sysctl_tcp_rmem is not Namespace-ify'd in older kernels]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-06 16:46:21 +02:00
Greg Kroah-Hartman
12ef385f51 This is the 4.4.130 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlrlXREACgkQONu9yGCS
 aT6hkA//dewP2AXQLE4Nb3ufxBOo2CcDOKRrMQTgwnaqe06FnXcdty8jZ6I6ziBV
 MtpYmyTY4Ujj062LPjwov6CEI3qS5aNXFd9Xqtq490FQBeKBA56MgZ7qOEb/aHDQ
 4e7D9hIFp+gu2J41toWMJO/BZ1qcMKOEo9e49NNQG+4JjNz6fhdQpdsGC8QqFauQ
 aT6l+cHhU+yxuBdj/Nzu5YnQrE4Ijk3omOsLrGR6hUCjG0pSOX2YK7XF+W+VcWd9
 7Uky54yPC/DuDLTLCWdZJxD/2WzrE0vvKhfkhty4X0k9ZXqnGbSr1x/FSKLrnXNi
 aigN2OBvlHI2RBwl8J06QOfmG32Ndi32nNj7f5+45VXQkiVK7OS3B0CgkYh4m+7J
 dEZVvH8Ddj4p6Io8WqWWRP0UDRNPP+reUwlIIg19dSKMv0veez1g0AFO7AbWh2Lu
 q9QRxtKTeYR7We8w3F2CeLHjP1NLhKvdffT8TVjhEL6MekWkOjFjQXGoRIOPpeld
 T0buhSQMgoEuJXMPz9zywJ2MmgKcRtmQKGKfZ7MDa82kwiIY5KRemTVnwn1lEKXU
 EC+kCQmjiobcuGbBbH1hXfWcSpDl/+ZwI5r0L2Lzkzf0R+Efge7rGBXZQCgz3FGp
 DVtpPPnRnXbGZX+0GyeiZ5O9wCfZTQxqOrMk2rtPbxJY5f6yIcE=
 =IA5i
 -----END PGP SIGNATURE-----

Merge 4.4.130 into android-4.4

Changes in 4.4.130
	cifs: do not allow creating sockets except with SMB1 posix exensions
	x86/tsc: Prevent 32bit truncation in calc_hpet_ref()
	perf: Return proper values for user stack errors
	staging: ion : Donnot wakeup kswapd in ion system alloc
	r8152: add Linksys USB3GIGV1 id
	Input: drv260x - fix initializing overdrive voltage
	ath9k_hw: check if the chip failed to wake up
	jbd2: fix use after free in kjournald2()
	Revert "ath10k: send (re)assoc peer command when NSS changed"
	s390: introduce CPU alternatives
	s390: enable CPU alternatives unconditionally
	KVM: s390: wire up bpb feature
	s390: scrub registers on kernel entry and KVM exit
	s390: add optimized array_index_mask_nospec
	s390/alternative: use a copy of the facility bit mask
	s390: add options to change branch prediction behaviour for the kernel
	s390: run user space and KVM guests with modified branch prediction
	s390: introduce execute-trampolines for branches
	s390: Replace IS_ENABLED(EXPOLINE_*) with IS_ENABLED(CONFIG_EXPOLINE_*)
	s390: do not bypass BPENTER for interrupt system calls
	s390/entry.S: fix spurious zeroing of r0
	s390: move nobp parameter functions to nospec-branch.c
	s390: add automatic detection of the spectre defense
	s390: report spectre mitigation via syslog
	s390: add sysfs attributes for spectre
	s390: correct nospec auto detection init order
	s390: correct module section names for expoline code revert
	bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
	KEYS: DNS: limit the length of option strings
	l2tp: check sockaddr length in pppol2tp_connect()
	net: validate attribute sizes in neigh_dump_table()
	llc: delete timers synchronously in llc_sk_free()
	tcp: don't read out-of-bounds opsize
	team: avoid adding twice the same option to the event list
	team: fix netconsole setup over team
	packet: fix bitfield update race
	pppoe: check sockaddr length in pppoe_connect()
	vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
	sctp: do not check port in sctp_inet6_cmp_addr
	llc: hold llc_sap before release_sock()
	llc: fix NULL pointer deref for SOCK_ZAPPED
	tipc: add policy for TIPC_NLA_NET_ADDR
	net: fix deadlock while clearing neighbor proxy table
	tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
	net: af_packet: fix race in PACKET_{R|T}X_RING
	ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
	scsi: mptsas: Disable WRITE SAME
	cdrom: information leak in cdrom_ioctl_media_changed()
	s390/cio: update chpid descriptor after resource accessibility event
	s390/uprobes: implement arch_uretprobe_is_alive()
	Linux 4.4.130

Change-Id: I58646180c70ac61da3e2a602085760881d914eb5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-04-30 05:54:58 -07:00
Jann Horn
09a37b3661 tcp: don't read out-of-bounds opsize
[ Upstream commit 7e5a206ab686f098367b61aca989f5cdfa8114a3 ]

The old code reads the "opsize" variable from out-of-bounds memory (first
byte behind the segment) if a broken TCP segment ends directly after an
opcode that is neither EOL nor NOP.

The result of the read isn't used for anything, so the worst thing that
could theoretically happen is a pagefault; and since the physmap is usually
mostly contiguous, even that seems pretty unlikely.

The following C reproducer triggers the uninitialized read - however, you
can't actually see anything happen unless you put something like a
pr_warn() in tcp_parse_md5sig_option() to print the opsize.

====================================
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/in.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <assert.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

#define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))

void sum_accumulate(unsigned int *sum, void *data, int len) {
  assert((len&2)==0);
  for (int i=0; i<len/2; i++) {
    *sum += ntohs(((unsigned short *)data)[i]);
  }
}

unsigned short sum_final(unsigned int sum) {
  sum = (sum >> 16) + (sum & 0xffff);
  sum = (sum >> 16) + (sum & 0xffff);
  return htons(~sum);
}

void fix_ip_sum(struct iphdr *ip) {
  unsigned int sum = 0;
  sum_accumulate(&sum, ip, sizeof(*ip));
  ip->check = sum_final(sum);
}

void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
  unsigned int sum = 0;
  struct {
    unsigned int saddr;
    unsigned int daddr;
    unsigned char pad;
    unsigned char proto_num;
    unsigned short tcp_len;
  } fakehdr = {
    .saddr = ip->saddr,
    .daddr = ip->daddr,
    .proto_num = ip->protocol,
    .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
  };
  sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
  sum_accumulate(&sum, tcp, tcp->doff*4);
  tcp->check = sum_final(sum);
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);
  systemf("ip addr add 192.168.42.1/24 dev %s", devname);

  struct {
    struct iphdr ip;
    struct tcphdr tcp;
    unsigned char tcp_opts[20];
  } __attribute__((packed)) syn_packet = {
    .ip = {
      .ihl = sizeof(struct iphdr)/4,
      .version = 4,
      .tot_len = htons(sizeof(syn_packet)),
      .ttl = 30,
      .protocol = IPPROTO_TCP,
      /* FIXUP check */
      .saddr = IPADDR(192,168,42,2),
      .daddr = IPADDR(192,168,42,1)
    },
    .tcp = {
      .source = htons(1),
      .dest = htons(1337),
      .seq = 0x12345678,
      .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
      .syn = 1,
      .window = htons(64),
      .check = 0 /*FIXUP*/
    },
    .tcp_opts = {
      /* INVALID: trailing MD5SIG opcode after NOPs */
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 19
    }
  };
  fix_ip_sum(&syn_packet.ip);
  fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
  while (1) {
    int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
    if (write_res != sizeof(syn_packet))
      err(1, "packet write failed");
  }
}
====================================

Fixes: cfb6eeb4c8 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-29 07:50:05 +02:00
Greg Kroah-Hartman
89904ccfe2 This is the 4.4.128 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlrQ7ekACgkQONu9yGCS
 aT6Znw//VtAP82BGP/+H6X6gt0rBRIYseEJkHOpKRu5PK+Vpx7mMQFIfBId95P6R
 buq1QyzY9yz8ixbByg/w60WA2jK/I9i0tDGBnSlZzNmUvbk01oBN+cc/weZDynF7
 rFbSvD3aTmPB4nm9VE+n7V/tgGeuu/bwi04zulAm/B0/zA+w9GZv/aAto3WlLdjF
 ogZPSo5y6ifm6Qryq9sTR42LyDBXOy1klRSIK5EXY1OnIvPL1HSYR3ea2yj3AMXB
 RPvpCCY8j7zC9yVifX1c+Gfv2tXVHb9kjgheJixP2J4M3fFlR5tjLQXtTP2S2I8G
 cuMcdT6MiQw31rMoLcpej66dMtkL3k6sEpzcnSPPNenTuDIolz7BLEyaO/hhgi9J
 6vIXAd4Xm9D8HkH3iG/L3GtD3JXpVPtHyli/X1M3hz/VNUSOUPENIjMmGoxfBOtQ
 d7c8VGxDjnqmafri3fBAm4c603qW7O1wqJ7vLs9z7vgOIxlOLoJ/uiazoJKgW6O0
 z0S/BABWqpAUAI9jgm2GvRDR2keM2mhQIgIrY0+ZpnaLSGe3MugB+GbK6xdBCuYA
 anOv9VTEAPlTc8gb+GlusbUVjQyacEDwXoT6f9mELCW8cqpMgh+3TiKFihbYkUTN
 ly/DxZH3jpva0dq94Mgjv1u/nlg9ac3zqGeo9buQQFC7MSoZKEM=
 =LiZa
 -----END PGP SIGNATURE-----

Merge 4.4.128 into android-4.4

Changes in 4.4.128
	cfg80211: make RATE_INFO_BW_20 the default
	md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
	rtc: snvs: fix an incorrect check of return value
	x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
	NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
	IB/srpt: Fix abort handling
	af_key: Fix slab-out-of-bounds in pfkey_compile_policy.
	mac80211: bail out from prep_connection() if a reconfig is ongoing
	bna: Avoid reading past end of buffer
	qlge: Avoid reading past end of buffer
	ipmi_ssif: unlock on allocation failure
	net: cdc_ncm: Fix TX zero padding
	net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
	lockd: fix lockd shutdown race
	drivers/misc/vmw_vmci/vmci_queue_pair.c: fix a couple integer overflow tests
	pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid()
	s390: move _text symbol to address higher than zero
	net/mlx4_en: Avoid adding steering rules with invalid ring
	NFSv4.1: Work around a Linux server bug...
	CIFS: silence lockdep splat in cifs_relock_file()
	blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
	net: qca_spi: Fix alignment issues in rx path
	netxen_nic: set rcode to the return status from the call to netxen_issue_cmd
	Input: elan_i2c - check if device is there before really probing
	Input: elantech - force relative mode on a certain module
	KVM: PPC: Book3S PR: Check copy_to/from_user return values
	vmxnet3: ensure that adapter is in proper state during force_close
	SMB2: Fix share type handling
	bus: brcmstb_gisb: Use register offsets with writes too
	bus: brcmstb_gisb: correct support for 64-bit address output
	PowerCap: Fix an error code in powercap_register_zone()
	ARM: dts: imx53-qsrb: Pulldown PMIC IRQ pin
	staging: wlan-ng: prism2mgmt.c: fixed a double endian conversion before calling hfa384x_drvr_setconfig16, also fixes relative sparse warning
	x86/tsc: Provide 'tsc=unstable' boot parameter
	ARM: dts: imx6qdl-wandboard: Fix audio channel swap
	ipv6: avoid dad-failures for addresses with NODAD
	async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
	usb: dwc3: keystone: check return value
	btrfs: fix incorrect error return ret being passed to mapping_set_error
	ata: libahci: properly propagate return value of platform_get_irq()
	neighbour: update neigh timestamps iff update is effective
	arp: honour gratuitous ARP _replies_
	usb: chipidea: properly handle host or gadget initialization failure
	USB: ene_usb6250: fix first command execution
	net: x25: fix one potential use-after-free issue
	USB: ene_usb6250: fix SCSI residue overwriting
	serial: 8250: omap: Disable DMA for console UART
	serial: sh-sci: Fix race condition causing garbage during shutdown
	sh_eth: Use platform device for printing before register_netdev()
	scsi: csiostor: fix use after free in csio_hw_use_fwconfig()
	powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash
	ath5k: fix memory leak on buf on failed eeprom read
	selftests/powerpc: Fix TM resched DSCR test with some compilers
	xfrm: fix state migration copy replay sequence numbers
	iio: hi8435: avoid garbage event at first enable
	iio: hi8435: cleanup reset gpio
	ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
	md-cluster: fix potential lock issue in add_new_disk
	ARM: davinci: da8xx: Create DSP device only when assigned memory
	ray_cs: Avoid reading past end of buffer
	leds: pca955x: Correct I2C Functionality
	sched/numa: Use down_read_trylock() for the mmap_sem
	net/mlx5: Tolerate irq_set_affinity_hint() failures
	selinux: do not check open permission on sockets
	block: fix an error code in add_partition()
	mlx5: fix bug reading rss_hash_type from CQE
	net: ieee802154: fix net_device reference release too early
	libceph: NULL deref on crush_decode() error path
	netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize
	pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
	ASoC: rsnd: SSI PIO adjust to 24bit mode
	scsi: bnx2fc: fix race condition in bnx2fc_get_host_stats()
	fix race in drivers/char/random.c:get_reg()
	ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
	tcp: better validation of received ack sequences
	net: move somaxconn init from sysctl code
	Input: elan_i2c - clear INT before resetting controller
	bonding: Don't update slave->link until ready to commit
	KVM: nVMX: Fix handling of lmsw instruction
	net: llc: add lock_sock in llc_ui_bind to avoid a race condition
	ARM: dts: ls1021a: add "fsl,ls1021a-esdhc" compatible string to esdhc node
	thermal: power_allocator: fix one race condition issue for thermal_instances list
	perf probe: Add warning message if there is unexpected event name
	l2tp: fix missing print session offset info
	rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
	hwmon: (ina2xx) Make calibration register value fixed
	media: videobuf2-core: don't go out of the buffer range
	ASoC: Intel: cht_bsw_rt5645: Analog Mic support
	scsi: libiscsi: Allow sd_shutdown on bad transport
	scsi: mpt3sas: Proper handling of set/clear of "ATA command pending" flag.
	vfb: fix video mode and line_length being set when loaded
	gpio: label descriptors using the device name
	ASoC: Intel: sst: Fix the return value of 'sst_send_byte_stream_mrfld()'
	wl1251: check return from call to wl1251_acx_arp_ip_filter
	hdlcdrv: Fix divide by zero in hdlcdrv_ioctl
	ovl: filter trusted xattr for non-admin
	powerpc/[booke|4xx]: Don't clobber TCR[WP] when setting TCR[DIE]
	dmaengine: imx-sdma: Handle return value of clk_prepare_enable
	arm64: futex: Fix undefined behaviour with FUTEX_OP_OPARG_SHIFT usage
	net/mlx5: avoid build warning for uniprocessor
	cxgb4: FW upgrade fixes
	rtc: opal: Handle disabled TPO in opal_get_tpo_time()
	rtc: interface: Validate alarm-time before handling rollover
	SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
	net: freescale: fix potential null pointer dereference
	KVM: SVM: do not zero out segment attributes if segment is unusable or not present
	clk: scpi: fix return type of __scpi_dvfs_round_rate
	clk: Fix __set_clk_rates error print-string
	powerpc/spufs: Fix coredump of SPU contexts
	perf trace: Add mmap alias for s390
	qlcnic: Fix a sleep-in-atomic bug in qlcnic_82xx_hw_write_wx_2M and qlcnic_82xx_hw_read_wx_2M
	mISDN: Fix a sleep-in-atomic bug
	drm/omap: fix tiled buffer stride calculations
	cxgb4: fix incorrect cim_la output for T6
	Fix serial console on SNI RM400 machines
	bio-integrity: Do not allocate integrity context for bio w/o data
	skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
	sit: reload iphdr in ipip6_rcv
	net/mlx4: Fix the check in attaching steering rules
	net/mlx4: Check if Granular QoS per VF has been enabled before updating QP qos_vport
	perf header: Set proper module name when build-id event found
	perf report: Ensure the perf DSO mapping matches what libdw sees
	tags: honor COMPILED_SOURCE with apart output directory
	e1000e: fix race condition around skb_tstamp_tx()
	cx25840: fix unchecked return values
	mceusb: sporadic RX truncation corruption fix
	net: phy: avoid genphy_aneg_done() for PHYs without clause 22 support
	ARM: imx: Add MXC_CPU_IMX6ULL and cpu_is_imx6ull
	e1000e: Undo e1000e_pm_freeze if __e1000_shutdown fails
	perf/core: Correct event creation with PERF_FORMAT_GROUP
	MIPS: mm: fixed mappings: correct initialisation
	MIPS: mm: adjust PKMAP location
	MIPS: kprobes: flush_insn_slot should flush only if probe initialised
	Fix loop device flush before configure v3
	net: emac: fix reset timeout with AR8035 phy
	perf tests: Decompress kernel module before objdump
	skbuff: only inherit relevant tx_flags
	xen: avoid type warning in xchg_xen_ulong
	bnx2x: Allow vfs to disable txvlan offload
	sctp: fix recursive locking warning in sctp_do_peeloff
	sparc64: ldc abort during vds iso boot
	iio: magnetometer: st_magn_spi: fix spi_device_id table
	Bluetooth: Send HCI Set Event Mask Page 2 command only when needed
	cpuidle: dt: Add missing 'of_node_put()'
	ACPICA: Events: Add runtime stub support for event APIs
	ACPICA: Disassembler: Abort on an invalid/unknown AML opcode
	s390/dasd: fix hanging safe offline
	vxlan: dont migrate permanent fdb entries during learn
	bcache: stop writeback thread after detaching
	bcache: segregate flash only volume write streams
	scsi: libsas: fix memory leak in sas_smp_get_phy_events()
	scsi: libsas: fix error when getting phy events
	scsi: libsas: initialize sas_phy status according to response of DISCOVER
	blk-mq: fix kernel oops in blk_mq_tag_idle()
	tty: n_gsm: Allow ADM response in addition to UA for control dlci
	EDAC, mv64x60: Fix an error handling path
	cxgb4vf: Fix SGE FL buffer initialization logic for 64K pages
	perf tools: Fix copyfile_offset update of output offset
	ipsec: check return value of skb_to_sgvec always
	rxrpc: check return value of skb_to_sgvec always
	virtio_net: check return value of skb_to_sgvec always
	virtio_net: check return value of skb_to_sgvec in one more location
	random: use lockless method of accessing and updating f->reg_idx
	futex: Remove requirement for lock_page() in get_futex_key()
	Kbuild: provide a __UNIQUE_ID for clang
	arp: fix arp_filter on l3slave devices
	net: fix possible out-of-bound read in skb_network_protocol()
	net/ipv6: Fix route leaking between VRFs
	netlink: make sure nladdr has correct size in netlink_connect()
	net/sched: fix NULL dereference in the error path of tcf_bpf_init()
	pptp: remove a buggy dst release in pptp_connect()
	sctp: do not leak kernel memory to user space
	sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
	sky2: Increase D3 delay to sky2 stops working after suspend
	vhost: correctly remove wait queue during poll failure
	vlan: also check phy_driver ts_info for vlan's real device
	bonding: fix the err path for dev hwaddr sync in bond_enslave
	bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave
	bonding: process the err returned by dev_set_allmulti properly in bond_enslave
	net: fool proof dev_valid_name()
	ip_tunnel: better validate user provided tunnel names
	ipv6: sit: better validate user provided tunnel names
	ip6_gre: better validate user provided tunnel names
	ip6_tunnel: better validate user provided tunnel names
	vti6: better validate user provided tunnel names
	r8169: fix setting driver_data after register_netdev
	net sched actions: fix dumping which requires several messages to user space
	net/ipv6: Increment OUTxxx counters after netfilter hook
	ipv6: the entire IPv6 header chain must fit the first fragment
	vrf: Fix use after free and double free in vrf_finish_output
	Revert "xhci: plat: Register shutdown for xhci_plat"
	Linux 4.4.128

Change-Id: I9c1e58f634cc18f15a840c9d192c892dfcc5ff73
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-04-14 15:35:32 +02:00
Eric Dumazet
87d96d1ba2 tcp: better validation of received ack sequences
[ Upstream commit d0e1a1b5a833b625c93d3d49847609350ebd79db ]

Paul Fiterau Brostean reported :

<quote>
Linux TCP stack we analyze exhibits behavior that seems odd to me.
The scenario is as follows (all packets have empty payloads, no window
scaling, rcv/snd window size should not be a factor):

       TEST HARNESS (CLIENT)                        LINUX SERVER

   1.  -                                          LISTEN (server listen,
then accepts)

   2.  - --> <SEQ=100><CTL=SYN>               --> SYN-RECEIVED

   3.  - <-- <SEQ=300><ACK=101><CTL=SYN,ACK>  <-- SYN-RECEIVED

   4.  - --> <SEQ=101><ACK=301><CTL=ACK>      --> ESTABLISHED

   5.  - <-- <SEQ=301><ACK=101><CTL=FIN,ACK>  <-- FIN WAIT-1 (server
opts to close the data connection calling "close" on the connection
socket)

   6.  - --> <SEQ=101><ACK=99999><CTL=FIN,ACK> --> CLOSING (client sends
FIN,ACK with not yet sent acknowledgement number)

   7.  - <-- <SEQ=302><ACK=102><CTL=ACK>      <-- CLOSING (ACK is 102
instead of 101, why?)

... (silence from CLIENT)

   8.  - <-- <SEQ=301><ACK=102><CTL=FIN,ACK>  <-- CLOSING
(retransmission, again ACK is 102)

Now, note that packet 6 while having the expected sequence number,
acknowledges something that wasn't sent by the server. So I would
expect
the packet to maybe prompt an ACK response from the server, and then be
ignored. Yet it is not ignored and actually leads to an increase of the
acknowledgement number in the server's retransmission of the FIN,ACK
packet. The explanation I found is that the FIN  in packet 6 was
processed, despite the acknowledgement number being unacceptable.
Further experiments indeed show that the server processes this FIN,
transitioning to CLOSING, then on receiving an ACK for the FIN it had
send in packet 5, the server (or better said connection) transitions
from CLOSING to TIME_WAIT (as signaled by netstat).

</quote>

Indeed, tcp_rcv_state_process() calls tcp_ack() but
does not exploit the @acceptable status but for TCP_SYN_RECV
state.

What we want here is to send a challenge ACK, if not in TCP_SYN_RECV
state. TCP_FIN_WAIT1 state is not the only state we should fix.

Add a FLAG_NO_CHALLENGE_ACK so that tcp_rcv_state_process()
can choose to send a challenge ACK and discard the packet instead
of wrongly change socket state.

With help from Neal Cardwell.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paul Fiterau Brostean <p.fiterau-brostean@science.ru.nl>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13 19:50:11 +02:00
Greg Kroah-Hartman
851fb4da32 This is the 4.4.124 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlq2IVkACgkQONu9yGCS
 aT5SGQ//eDH59qGQJFy8GwDQULnYV8JEeBZT2tIzx2LDVJvKKz/8BkJmcGZrQ0gH
 9EnMCiPkbxRH6HV6Cu1SHcyLFK+iVXGEw28Uk/sEcesQSZvrRxUKIgRLvyWhANgP
 E1FiPbI6V1tXTROmUk/lrGZYgHMVUMAa+kYRndpbijtB++7qO25mVngP0Yg6OveO
 oyw0MRx+v9mwQS/SID2EtlXnZOVGM+7kd3gg5fLY5w0qJT1jHjhrtZNWyKH2SMrS
 QnzkFDMGDIWk5TrHJY0LEvpZI/toKNtPrG/Mt5gOvtSgjmQ4+EEVndrlutPAOa3K
 xF6ERjtyBIRRG2ftk122fm6brjpxyCoqycD8JgCm1BxalNN7Bg+1ogQCGwE0EWNp
 yYEsxLmSbLllX/4rkZtx+WsgiG4rwNWG1/IsPsEo80La3M53WzA3My8E0aVLpbIj
 aqkzR62lZ1TSRgtbFjm6Bl4DtpJH4f9uELbO0VBi7b8LRTUI99Qcz4bOoKH+WKOC
 umvHsuyMwDm8wc0KhQ4hdyGb1le0nS4Y1xydp1p0pcASLfeA2VOIg20ZvxQVBTZA
 rlHo7zEQVkVYvphoPONUodOUVZ8/AGxoVvim3HlI6s5VvtdF6k/hQUWaOuWDh59J
 bjw2vnMWOLWSqmmkpK9HmU6ohIV310TJewPu8BShzAuZRSNDxl8=
 =OCjx
 -----END PGP SIGNATURE-----

Merge 4.4.124 into android-4.4

Changes in 4.4.124
	tpm: fix potential buffer overruns caused by bit glitches on the bus
	tpm_tis: fix potential buffer overruns caused by bit glitches on the bus
	SMB3: Validate negotiate request must always be signed
	CIFS: Enable encryption during session setup phase
	staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
	platform/x86: asus-nb-wmi: Add wapf4 quirk for the X302UA
	regulator: anatop: set default voltage selector for pcie
	x86: i8259: export legacy_pic symbol
	rtc: cmos: Do not assume irq 8 for rtc when there are no legacy irqs
	Input: ar1021_i2c - fix too long name in driver's device table
	time: Change posix clocks ops interfaces to use timespec64
	ACPI/processor: Fix error handling in __acpi_processor_start()
	ACPI/processor: Replace racy task affinity logic
	cpufreq/sh: Replace racy task affinity logic
	genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
	i2c: i2c-scmi: add a MS HID
	net: ipv6: send unsolicited NA on admin up
	media/dvb-core: Race condition when writing to CAM
	spi: dw: Disable clock after unregistering the host
	ath: Fix updating radar flags for coutry code India
	clk: ns2: Correct SDIO bits
	scsi: virtio_scsi: Always try to read VPD pages
	KVM: PPC: Book3S PR: Exit KVM on failed mapping
	ARM: 8668/1: ftrace: Fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER
	iommu/omap: Register driver before setting IOMMU ops
	md/raid10: wait up frozen array in handle_write_completed
	NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()
	tcp: remove poll() flakes with FastOpen
	e1000e: fix timing for 82579 Gigabit Ethernet controller
	ALSA: hda - Fix headset microphone detection for ASUS N551 and N751
	IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow
	IB/ipoib: Update broadcast object if PKey value was changed in index 0
	HSI: ssi_protocol: double free in ssip_pn_xmit()
	IB/mlx4: Take write semaphore when changing the vma struct
	IB/mlx4: Change vma from shared to private
	ASoC: Intel: Skylake: Uninitialized variable in probe_codec()
	Fix driver usage of 128B WQEs when WQ_CREATE is V1.
	netfilter: xt_CT: fix refcnt leak on error path
	openvswitch: Delete conntrack entry clashing with an expectation.
	mmc: host: omap_hsmmc: checking for NULL instead of IS_ERR()
	wan: pc300too: abort path on failure
	qlcnic: fix unchecked return value
	scsi: mac_esp: Replace bogus memory barrier with spinlock
	infiniband/uverbs: Fix integer overflows
	NFS: don't try to cross a mountpount when there isn't one there.
	iio: st_pressure: st_accel: Initialise sensor platform data properly
	mt7601u: check return value of alloc_skb
	rndis_wlan: add return value validation
	Btrfs: send, fix file hole not being preserved due to inline extent
	mac80211: don't parse encrypted management frames in ieee80211_frame_acked
	mfd: palmas: Reset the POWERHOLD mux during power off
	mtip32xx: use runtime tag to initialize command header
	staging: unisys: visorhba: fix s-Par to boot with option CONFIG_VMAP_STACK set to y
	staging: wilc1000: fix unchecked return value
	mmc: sdhci-of-esdhc: limit SD clock for ls1012a/ls1046a
	ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP
	ipmi/watchdog: fix wdog hang on panic waiting for ipmi response
	ACPI / PMIC: xpower: Fix power_table addresses
	drm/nouveau/kms: Increase max retries in scanout position queries.
	bnx2x: Align RX buffers
	power: supply: pda_power: move from timer to delayed_work
	Input: twl4030-pwrbutton - use correct device for irq request
	md/raid10: skip spare disk as 'first' disk
	ia64: fix module loading for gcc-5.4
	tcm_fileio: Prevent information leak for short reads
	video: fbdev: udlfb: Fix buffer on stack
	sm501fb: don't return zero on failure path in sm501fb_start()
	net: hns: fix ethtool_get_strings overflow in hns driver
	cifs: small underflow in cnvrtDosUnixTm()
	rtc: ds1374: wdt: Fix issue with timeout scaling from secs to wdt ticks
	rtc: ds1374: wdt: Fix stop/start ioctl always returning -EINVAL
	perf tests kmod-path: Don't fail if compressed modules aren't supported
	Bluetooth: hci_qca: Avoid setup failure on missing rampatch
	media: c8sectpfe: fix potential NULL pointer dereference in c8sectpfe_timer_interrupt
	drm/msm: fix leak in failed get_pages
	RDMA/iwpm: Fix uninitialized error code in iwpm_send_mapinfo()
	rtlwifi: rtl_pci: Fix the bug when inactiveps is enabled.
	media: bt8xx: Fix err 'bt878_probe()'
	media: [RESEND] media: dvb-frontends: Add delay to Si2168 restart
	cros_ec: fix nul-termination for firmware build info
	platform/chrome: Use proper protocol transfer function
	mmc: avoid removing non-removable hosts during suspend
	IB/ipoib: Avoid memory leak if the SA returns a different DGID
	RDMA/cma: Use correct size when writing netlink stats
	IB/umem: Fix use of npages/nmap fields
	vgacon: Set VGA struct resource types
	drm/omap: DMM: Check for DMM readiness after successful transaction commit
	pty: cancel pty slave port buf's work in tty_release
	coresight: Fix disabling of CoreSight TPIU
	pinctrl: Really force states during suspend/resume
	iommu/vt-d: clean up pr_irq if request_threaded_irq fails
	ip6_vti: adjust vti mtu according to mtu of lower device
	RDMA/ocrdma: Fix permissions for OCRDMA_RESET_STATS
	nfsd4: permit layoutget of executable-only files
	clk: si5351: Rename internal plls to avoid name collisions
	dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63
	RDMA/ucma: Fix access to non-initialized CM_ID object
	Linux 4.4.124

Change-Id: Iac6f5bda7941f032c5b1f58750e084140b0e3f23
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-25 10:51:55 +02:00
Eric Dumazet
2349cbd511 tcp: remove poll() flakes with FastOpen
[ Upstream commit 0f9fa831aecfc297b7b45d4f046759bcefcf87f0 ]

When using TCP FastOpen for an active session, we send one wakeup event
from tcp_finish_connect(), right before the data eventually contained in
the received SYNACK is queued to sk->sk_receive_queue.

This means that depending on machine load or luck, poll() users
might receive POLLOUT events instead of POLLIN|POLLOUT

To fix this, we need to move the call to sk->sk_state_change()
after the (optional) call to tcp_rcv_fastopen_synack()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24 10:58:42 +01:00
Greg Kroah-Hartman
8a5396242e This is the 4.4.105 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlosIJUACgkQONu9yGCS
 aT5/DQ/9Hkroo15HPErzBhLUIo3tdT8HyMBSA5gjy3x+fYBBkbj1HvyPWlwbZVgQ
 q2xQs7+Au8UlLLOV2UDz5C3K+hU4tM+uS3TC781ats5+VYWXAg6lXMshqwuXCDCX
 eZ1joC1cs+tHQ1AinXV341c6TX9HOQ1/kwZRxvS9q/n8PQd71uzIXxsp/AkqloMQ
 zTYz4FKdFPmKrHmMHR2X4MbRX85S5S57IJwmsfFEB+89PhfHGT5UkuClUuVrxV6m
 a34BcDDG2SZ9V58OEDS99SJIvVthyisv7iDwoe6qBO0cITI/DiZ15T53gR+6nhMd
 L+Y78wWohLFXxNoQFVXax/viFlsSju9j3f9NZeBb3olFlGuSiITsoJethU33Tzg0
 Lf1X9ELMekBDyE/e2nYN+wILz6WioWxHrJ9nqAkwxOAL8YQTjfiUnjv/TUO/AJZ3
 dmoJyrHOjkUTJpSQXx60ZwyvIyBv/oy8CWbr0lzVoq23OBrP26o/1C1Sbrm6bGMG
 6fUccAqFEG9IW7fxUM0ojFRGL/+eSkHIMeEQopfNs4XXIBV53kUD8en+Hf4n5sic
 NskK/UZSrC8KCAG4wmv1T+alDSrjhn+7xMdzRieEthhjFouwH3i8OROaVS+YX8+y
 oDZpCZxBMhMCFKl+CWbqj8+DdNJQgs5+7R6YhEjEmxpiPfIPQhQ=
 =XPGm
 -----END PGP SIGNATURE-----

Merge 4.4.105 into android-4.4

Changes in 4.4.105
	bcache: only permit to recovery read error when cache device is clean
	bcache: recover data from backing when data is clean
	uas: Always apply US_FL_NO_ATA_1X quirk to Seagate devices
	usb: quirks: Add no-lpm quirk for KY-688 USB 3.1 Type-C Hub
	serial: 8250_pci: Add Amazon PCI serial device ID
	s390/runtime instrumentation: simplify task exit handling
	USB: serial: option: add Quectel BG96 id
	ima: fix hash algorithm initialization
	s390/pci: do not require AIS facility
	selftests/x86/ldt_get: Add a few additional tests for limits
	serial: 8250_fintek: Fix rs485 disablement on invalid ioctl()
	spi: sh-msiof: Fix DMA transfer size check
	usb: phy: tahvo: fix error handling in tahvo_usb_probe()
	serial: 8250: Preserve DLD[7:4] for PORT_XR17V35X
	x86/entry: Use SYSCALL_DEFINE() macros for sys_modify_ldt()
	EDAC, sb_edac: Fix missing break in switch
	sysrq : fix Show Regs call trace on ARM
	perf test attr: Fix ignored test case result
	kprobes/x86: Disable preemption in ftrace-based jprobes
	net: systemport: Utilize skb_put_padto()
	net: systemport: Pad packet before inserting TSB
	ARM: OMAP1: DMA: Correct the number of logical channels
	vti6: fix device register to report IFLA_INFO_KIND
	net/appletalk: Fix kernel memory disclosure
	ravb: Remove Rx overflow log messages
	nfs: Don't take a reference on fl->fl_file for LOCK operation
	KVM: arm/arm64: Fix occasional warning from the timer work function
	NFSv4: Fix client recovery when server reboots multiple times
	drm/exynos/decon5433: set STANDALONE_UPDATE_F on output enablement
	net: sctp: fix array overrun read on sctp_timer_tbl
	tipc: fix cleanup at module unload
	dmaengine: pl330: fix double lock
	tcp: correct memory barrier usage in tcp_check_space()
	mm: avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers
	xen-netfront: Improve error handling during initialization
	net: fec: fix multicast filtering hardware setup
	Revert "ocfs2: should wait dio before inode lock in ocfs2_setattr()"
	usb: hub: Cycle HUB power when initialization fails
	usb: xhci: fix panic in xhci_free_virt_devices_depth_first
	usb: Add USB 3.1 Precision time measurement capability descriptor support
	usb: ch9: Add size macro for SSP dev cap descriptor
	USB: core: Add type-specific length check of BOS descriptors
	USB: Increase usbfs transfer limit
	USB: devio: Prevent integer overflow in proc_do_submiturb()
	USB: usbfs: Filter flags passed in from user space
	usb: host: fix incorrect updating of offset
	xen-netfront: avoid crashing on resume after a failure in talk_to_netback()
	Linux 4.4.105

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-12-10 14:42:29 +01:00
Jason Baron
1b7dbabf02 tcp: correct memory barrier usage in tcp_check_space()
[ Upstream commit 56d806222ace4c3aeae516cd7a855340fb2839d8 ]

sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.

Fixes: 3c7151275c ("tcp: add memory barriers to write space paths")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-09 18:42:43 +01:00
Greg Kroah-Hartman
7eab308a49 This is the 4.4.99 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloQBz0ACgkQONu9yGCS
 aT5nQxAAs/xWKpYLSLvLPYnTOmSmNJ36isgFriVT+wWOLzkrWTWuoQnluDjMQjie
 nH6whZMlOnG+k5GrGF3XymxZ66tDj9TlXnPAHCC8ikcqir2/dBO/gO5v2gmFgF2E
 j52mt09I3acBQJEt+Rz3xJCMa5so61uDYGtqk/URcPEW1nBa1rfA1QIy/9zv2/aw
 2yPSz4NQlv+7yvjguw4Ik5Yt/hGeu1Y8Kuc4bVHG2TB+y0QYwri42bBwQV7llili
 XqwfjFJYGMqWJqHGF/p0hD+/Xylw6GnDzxDZQDMNhsuWcfe3tUhuOVkX30E96fh2
 ipT4wI5DTmql8EN/r/P7VS2BKL4W5HEMeNEd2APkGNGnSrzKGbd0CQ+cWVZbr645
 R03AbqZjhaQKwRi+n82q1mMb4p+3Z/F/T8twHmYg/DLta3kzzRdfVJPNFBFIoSnF
 Bay8KJKqoMv2Bjhla78pHMoqSQ9j/fJc2iPIAABtFlsTjic+/STiS7ANAsmDdJtt
 8XXc6mFQfbulKKlKKqudPLjOpUNu1SrsOcc9gmovbTy7dN6FBOfJwFMCYonNyXAc
 6/ACSxYJlnZ9YEacEmcXmz0GTytyKiTYE3fNsXc/8fHnRZ1+yea9Mo77wWkj7K4V
 IqNIJMCW8K+P97oL6mdUBZwMUi4zrWueakMq8SWBdKYaD5yeV2k=
 =ql4j
 -----END PGP SIGNATURE-----

Merge 4.4.99 into android-4.4

Changes in 4.4.99
	mac80211: accept key reinstall without changing anything
	mac80211: use constant time comparison with keys
	mac80211: don't compare TKIP TX MIC key in reinstall prevention
	usb: usbtest: fix NULL pointer dereference
	Input: ims-psu - check if CDC union descriptor is sane
	ALSA: seq: Cancel pending autoload work at unbinding device
	tun/tap: sanitize TUNSETSNDBUF input
	tcp: fix tcp_mtu_probe() vs highest_sack
	l2tp: check ps->sock before running pppol2tp_session_ioctl()
	tun: call dev_get_valid_name() before register_netdevice()
	sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
	packet: avoid panic in packet_getsockopt()
	ipv6: flowlabel: do not leave opt->tot_len with garbage
	net/unix: don't show information about sockets from other namespaces
	ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
	tun: allow positive return values on dev_get_valid_name() call
	sctp: reset owner sk for data chunks on out queues when migrating a sock
	ppp: fix race in ppp device destruction
	ipip: only increase err_count for some certain type icmp in ipip_err
	tcp/dccp: fix ireq->opt races
	tcp/dccp: fix lockdep splat in inet_csk_route_req()
	tcp/dccp: fix other lockdep splats accessing ireq_opt
	security/keys: add CONFIG_KEYS_COMPAT to Kconfig
	tipc: fix link attribute propagation bug
	brcmfmac: remove setting IBSS mode when stopping AP
	target/iscsi: Fix iSCSI task reassignment handling
	target: Fix node_acl demo-mode + uncached dynamic shutdown regression
	misc: panel: properly restore atomic counter on error path
	Linux 4.4.99

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-11-18 17:24:24 +01:00
Eric Dumazet
13eddc6756 tcp/dccp: fix ireq->opt races
[ Upstream commit c92e8c02fe664155ac4234516e32544bec0f113d ]

syzkaller found another bug in DCCP/TCP stacks [1]

For the reasons explained in commit ce1050089c ("tcp/dccp: fix
ireq->pktopts race"), we need to make sure we do not access
ireq->opt unless we own the request sock.

Note the opt field is renamed to ireq_opt to ease grep games.

[1]
BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
 ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
 tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
 tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
 tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
 __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
 tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
 tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
 tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
 tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x40c341
RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

Allocated by task 3295:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 __do_kmalloc mm/slab.c:3725 [inline]
 __kmalloc+0x162/0x760 mm/slab.c:3734
 kmalloc include/linux/slab.h:498 [inline]
 tcp_v4_save_options include/net/tcp.h:1962 [inline]
 tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
 tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
 tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
 tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
 tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
 tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 3306:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3503 [inline]
 kfree+0xca/0x250 mm/slab.c:3820
 inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
 __sk_destruct+0xfd/0x910 net/core/sock.c:1560
 sk_destruct+0x47/0x80 net/core/sock.c:1595
 __sk_free+0x57/0x230 net/core/sock.c:1603
 sk_free+0x2a/0x40 net/core/sock.c:1614
 sock_put include/net/sock.h:1652 [inline]
 inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
 tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
 tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-18 11:11:06 +01:00
Greg Kroah-Hartman
610af855d9 This is the 4.4.85 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmmdSUACgkQONu9yGCS
 aT4lHg/7BJMLfX+Cu7XVaZgxNFym3gdh6+AnsSvqGqenbjRirCeh+bdK4u6iNM8v
 h8rGYyp92rYJ168piFxdsRoAl2u4dZBpczOqhpEkwFDx8tI+/B+icWeILI4SX0N2
 QWhim6tTTWy2Thw862M7lh5aJl2GxwJtxi/RXXzHq4u4w0NKPFUb+AfXEmUHDoXB
 Q6Hz8mo6dcjsW5gyNsBvsYQwvqHpB935Ok2Juz7dwarHx7CWJ+v2fqk9cIf3Nll8
 Ia04sg1HCRTePyWD0yld6jCpL51X2ZMVLa37RZCw/9WEDotFdVQO5NUg2ryCQQzN
 hNmoiJ47QLBXbZR2rQn5XEtSfWZtplOnm0tB+UYRvxJxtxJGzGTdwUNFdu4iBG4+
 xDSXbchTfyH7x93TxsvSZ+PS1NfFblYX8HETvoI2MO8PrGDdeHBZllVfF32xcK3L
 VyU+wA1L3quPk0h3MvaFXwoOW8gUAIUyQZEXGXOWTMFDCz88UeBbvPkRAfkyIeYs
 UhN8mlnM5cHhC3pPyQKFJ3kTFdQ6pZ79KLNqhvmordvfXBjTZwPt0zNYOlZKWTQR
 49WFvxEGH4B68TVc2D4mHGbciqtb+GoTQx4w3HsmyS6FF3hzPqR0L4UOvhiMaDVe
 kumziwhF9C6viis7dRlgXyJ5iydUJIcD5mJydfuPT2XIkG85eiU=
 =SWxy
 -----END PGP SIGNATURE-----

Merge 4.4.85 into android-4.4

Changes in 4.4.85
	af_key: do not use GFP_KERNEL in atomic contexts
	dccp: purge write queue in dccp_destroy_sock()
	dccp: defer ccid_hc_tx_delete() at dismantle time
	ipv4: fix NULL dereference in free_fib_info_rcu()
	net_sched/sfq: update hierarchical backlog when drop packet
	ipv4: better IP_MAX_MTU enforcement
	sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
	tipc: fix use-after-free
	ipv6: reset fn->rr_ptr when replacing route
	ipv6: repair fib6 tree in failure case
	tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
	irda: do not leak initialized list.dev to userspace
	net: sched: fix NULL pointer dereference when action calls some targets
	net_sched: fix order of queue length updates in qdisc_replace()
	mei: me: add broxton pci device ids
	mei: me: add lewisburg device ids
	Input: trackpoint - add new trackpoint firmware ID
	Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310
	ALSA: core: Fix unexpected error at replacing user TLV
	ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978)
	ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses
	i2c: designware: Fix system suspend
	drm: Release driver tracking before making the object available again
	drm/atomic: If the atomic check fails, return its value first
	drm: rcar-du: lvds: Fix PLL frequency-related configuration
	drm: rcar-du: lvds: Rename PLLEN bit to PLLON
	drm: rcar-du: Fix crash in encoder failure error path
	drm: rcar-du: Fix display timing controller parameter
	drm: rcar-du: Fix H/V sync signal polarity configuration
	tracing: Fix freeing of filter in create_filter() when set_str is false
	cifs: Fix df output for users with quota limits
	cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
	nfsd: Limit end of page list when decoding NFSv4 WRITE
	perf/core: Fix group {cpu,task} validation
	Bluetooth: hidp: fix possible might sleep error in hidp_session_thread
	Bluetooth: cmtp: fix possible might sleep error in cmtp_session
	Bluetooth: bnep: fix possible might sleep error in bnep_session
	binder: use group leader instead of open thread
	binder: Use wake up hint for synchronous transactions.
	ANDROID: binder: fix proc->tsk check.
	iio: imu: adis16480: Fix acceleration scale factor for adis16480
	iio: hid-sensor-trigger: Fix the race with user space powering up sensors
	staging: rtl8188eu: add RNX-N150NUB support
	ASoC: simple-card: don't fail if sysclk setting is not supported
	ASoC: rsnd: disable SRC.out only when stop timing
	ASoC: rsnd: avoid pointless loop in rsnd_mod_interrupt()
	ASoC: rsnd: Add missing initialization of ADG req_rate
	ASoC: rsnd: ssi: 24bit data needs right-aligned settings
	ASoC: rsnd: don't call update callback if it was NULL
	ntb_transport: fix qp count bug
	ntb_transport: fix bug calculating num_qps_mw
	ACPI: ioapic: Clear on-stack resource before using it
	ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal
	Linux 4.4.85

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-08-30 14:35:43 +02:00
Neal Cardwell
4e39b7409f tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
[ Upstream commit cdbeb633ca71a02b7b63bfeb94994bf4e1a0b894 ]

In some situations tcp_send_loss_probe() can realize that it's unable
to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
realizes that the RTO was eligible to fire immediately or at some
point in the past (delta_us <= 0). Previously in such cases
tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
icsk_rto, which caused needless delays of hundreds of milliseconds
(and non-linear behavior that made reproducible testing
difficult). This commit changes the logic to schedule "overdue" RTOs
ASAP, rather than at now + icsk_rto.

Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30 10:19:21 +02:00
Greg Kroah-Hartman
4b8fc9f2bc This is the 4.4.82 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmPuZcACgkQONu9yGCS
 aT5o0BAAlT21EbhyxoMPC6xrPHAF1Oi8mTVfpu+618AUs3B1M6xge/EKI08B/8DP
 MZgaqSqY5ttaIlDKX5OVhY+HiuMg3SbIaFDzhS+OzjpuIjSA9ljNHazp5/l2HsQu
 9zyPFN02L2zqWYppyDo6FQBfStB5rUHB4eVMgD6zuNU/YQovtibGqAY4LBfWvxf/
 eDO6VfjiS4zzcCoplZxcxim1YVZ+HX09BuwniJzukM4C4/uMDubwMlJmrN9YsQZW
 x5zWnLHce2MATk9yF4BzMI/iRDR+Bm6Vx3m1Vzq9WDu7/kkMTVdYjXHmZn02YQub
 q4q1nDZyzBf8SvE4kf+fMYS8+dUrwiKf0lahBTK31J5Bc33lRfBfvv+dr/aEnp/Q
 FhraSkcBDrulnxuq77WZbvWzj0otF1pCTtURyCSfdc4SOFwVIz2NLQ2ZnnXk4gnN
 h5TqjxSDwr2CwTMzOnaKjBcuWnKPvn3/Pjm+/MJS8wvQYPZv8a4AzMIwxjDEN78Z
 +FvtaWEoUCnlP869hyR7gTfk2541+qjMdDRRUPSQ16PvepKy1AG9iCqVvZThScyQ
 PygaiBYZ9pbcyFuExLQrj2FDY2odinPfN8IsCQQbk5Es5mCdzJZOOkLeO2PO0MxD
 Dya79igFnpNj7ZEu6T7lD6Izg/6fYWu7qKmpDKQ7/xn4hHxj/Ig=
 =D9TG
 -----END PGP SIGNATURE-----

Merge 4.4.82 into android-4.4

Changes in 4.4.82
	tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states
	net: fix keepalive code vs TCP_FASTOPEN_CONNECT
	bpf, s390: fix jit branch offset related to ldimm64
	net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
	tcp: fastopen: tcp_connect() must refresh the route
	net: avoid skb_warn_bad_offload false positives on UFO
	packet: fix tp_reserve race in packet_set_ring
	revert "net: account for current skb length when deciding about UFO"
	revert "ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output"
	udp: consistently apply ufo or fragmentation
	sparc64: Prevent perf from running during super critical sections
	KVM: arm/arm64: Handle hva aging while destroying the vm
	mm/mempool: avoid KASAN marking mempool poison checks as use-after-free
	ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
	net: account for current skb length when deciding about UFO
	Linux 4.4.82

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-08-14 10:17:08 -07:00
Yuchung Cheng
025bb7f7e9 tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states
[ Upstream commit ed254971edea92c3ac5c67c6a05247a92aa6075e ]

If the sender switches the congestion control during ECN-triggered
cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to
the ssthresh value calculated by the previous congestion control. If
the previous congestion control is BBR that always keep ssthresh
to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe
step is to avoid assigning invalid ssthresh value when recovery ends.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-12 19:29:08 -07:00
Greg Kroah-Hartman
cc3d2b7361 This is the 4.4.77 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllp5y8ACgkQONu9yGCS
 aT6g8hAApzYi9TwiaF6wyYXsrp7YvOm4NyMaVBl4t7v/nFql6VsUL+qWaJKB5EL9
 o72ybYPUzbxGTVWCm/wiBO31VWea0ak0pBbyywBiowGgwAcgG/jpqZobale4Y2TE
 15jEpmpA5+3BmXpMkrv/dz4LHZ4jm65/ADhMbkPGRZqUJ3mHmyVoi50l67dpTE5+
 xWQIErycwlVMppJGnXPeFFgeD7Etch7OJ9CishQRNMb3F8H58WiQrMWWe1NfL0DO
 H2g18IBHMsxEYJqnRqxviTOMe8S96Km+lKGX0LOTRYt+2OQpfIF7buU6N+6C96rO
 7V2n2G02m2mOFVUFlDYF1RQ9IBrxHJf9iGkaZBwsaxX7XAK63ZjRxgjnEL7gMPU/
 TMCOWZ53BdZezz2eAmdhySsV+4Xt6MmJJE8rR47AgsM2Le3tgK421zmraunmA0fR
 eoJS99YHcftAHXCD3puGLafEwGVe0G4eQbY4L7mj1Y9VjaAbmmsWq9rlNOQMZRgH
 JTNyYik1C7yGPJX1iKi9hLAKldzBwPuM3GfZMZQIOjA4t2VtSon7in5iKrihRg3N
 BSKXr6+orNw32tsqcC4kpLPbFUFb6zx3EKELwSJwD9ICN7swJEk7gFw7w/F/SOxI
 C1W4Ulm6EcYTWHDePERQ4zHlllHAmyJup61d9HnwA6HhPOLaff4=
 =oeNk
 -----END PGP SIGNATURE-----

Merge 4.4.77 into android-4.4

Changes in 4.4.77
	fs: add a VALID_OPEN_FLAGS
	fs: completely ignore unknown open flags
	driver core: platform: fix race condition with driver_override
	bgmac: reset & enable Ethernet core before using it
	mm: fix classzone_idx underflow in shrink_zones()
	tracing/kprobes: Allow to create probe with a module name starting with a digit
	drm/virtio: don't leak bo on drm_gem_object_init failure
	usb: dwc3: replace %p with %pK
	USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
	Add USB quirk for HVR-950q to avoid intermittent device resets
	usb: usbip: set buffer pointers to NULL after free
	usb: Fix typo in the definition of Endpoint[out]Request
	mac80211_hwsim: Replace bogus hrtimer clockid
	sysctl: don't print negative flag for proc_douintvec
	sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec
	pinctrl: sh-pfc: r8a7791: Fix SCIF2 pinmux data
	pinctrl: meson: meson8b: fix the NAND DQS pins
	pinctrl: sunxi: Fix SPDIF function name for A83T
	pinctrl: mxs: atomically switch mux and drive strength config
	pinctrl: sh-pfc: Update info pointer after SoC-specific init
	USB: serial: option: add two Longcheer device ids
	USB: serial: qcserial: new Sierra Wireless EM7305 device ID
	gfs2: Fix glock rhashtable rcu bug
	x86/tools: Fix gcc-7 warning in relocs.c
	x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
	ath10k: override CE5 config for QCA9377
	KEYS: Fix an error code in request_master_key()
	RDMA/uverbs: Check port number supplied by user verbs cmds
	mqueue: fix a use-after-free in sys_mq_notify()
	tools include: Add a __fallthrough statement
	tools string: Use __fallthrough in perf_atoll()
	tools strfilter: Use __fallthrough
	perf top: Use __fallthrough
	perf intel-pt: Use __fallthrough
	perf thread_map: Correctly size buffer used with dirent->dt_name
	perf scripting perl: Fix compile error with some perl5 versions
	perf tests: Avoid possible truncation with dirent->d_name + snprintf
	perf bench numa: Avoid possible truncation when using snprintf()
	perf tools: Use readdir() instead of deprecated readdir_r()
	perf thread_map: Use readdir() instead of deprecated readdir_r()
	perf script: Use readdir() instead of deprecated readdir_r()
	perf tools: Remove duplicate const qualifier
	perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
	perf pmu: Fix misleadingly indented assignment (whitespace)
	perf dwarf: Guard !x86_64 definitions under #ifdef else clause
	perf trace: Do not process PERF_RECORD_LOST twice
	perf tests: Remove wrong semicolon in while loop in CQM test
	perf tools: Use readdir() instead of deprecated readdir_r() again
	md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
	md: fix super_offset endianness in super_1_rdev_size_change
	tcp: fix tcp_mark_head_lost to check skb len before fragmenting
	staging: vt6556: vnt_start Fix missing call to vnt_key_init_table.
	staging: comedi: fix clean-up of comedi_class in comedi_init()
	ext4: check return value of kstrtoull correctly in reserved_clusters_store
	x86/mm/pat: Don't report PAT on CPUs that don't support it
	saa7134: fix warm Medion 7134 EEPROM read
	Linux 4.4.77

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-15 13:29:08 +02:00
Neal Cardwell
627f3abeea tcp: fix tcp_mark_head_lost to check skb len before fragmenting
commit d88270eef4b56bd7973841dd1fed387ccfa83709 upstream.

This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.

tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.

For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.

This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.

The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Vinson Lee <vlee@freedesktop.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 11:57:49 +02:00
Greg Kroah-Hartman
6fc0573f6d This is the 4.4.71 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlk30BwACgkQONu9yGCS
 aT5cmhAAh3etTuZ3xRw2eGW/Y/C8L2F2CjJjmR4vp1ms8P55uZg3xA20r5jNj7Ho
 pwag3WTNzHpVfKFApavfEzToqDszRAtXcvYPPW9uXUPeu8LWyBJyvmN7lSQVKgDc
 M9SWsd+8EGceopaj8KHjLMxNsV2n8j2ckxNf/BL/KgiMtJlgp/1TCDKUVS1k0cA7
 CsuxDhxpRYpQofsIVww1hdrwCxVuntAY7u+/3B19ozXGFSRe/h5GO6xYRcG8pqfT
 lvIgD6btdQJwI55QoSpJCpL96a534zc+akO0dtyaMJ3Q8UWQXD3JF8ZxMiPPrAe8
 CLW390ATranIafmLi9g9DU1vQeEPNFXpeiYfxe65YL7igeAj/uPtVzKp0MvRcKG7
 IBVNxbtsTQa73ig7gKSJ323CnpEfrr/XG73JNVtUQLxHa2poY7SUonRI587MFW2T
 sONl9Pk3TxRC7Rc45si4RFsIj4jEF8ubUDXOPb2CrmDMB7MrM0PHfOW9lLCP92FD
 pn0fM4vwNvm2ILsblqNcBumgeIBQ8ld2TBTbhRbh2FK4Rzxd2TSlWh4KqkcWcXCt
 Lz8conU06AwTvDob1xoht3m6Gj32maopKZKGn5/Wq0YlfjOB/70CXOvPO3ChhKTh
 QGNgA66bYdm+xn55wf7ty7Bq8yO6kcSNPQCXOb9S61nfCLA4KHM=
 =U7IH
 -----END PGP SIGNATURE-----

Merge 4.4.71 into android-4.4

Changes in 4.4.71
	sparc: Fix -Wstringop-overflow warning
	dccp/tcp: do not inherit mc_list from parent
	ipv6/dccp: do not inherit ipv6_mc_list from parent
	s390/qeth: handle sysfs error during initialization
	s390/qeth: unbreak OSM and OSN support
	s390/qeth: avoid null pointer dereference on OSN
	tcp: avoid fragmenting peculiar skbs in SACK
	sctp: fix src address selection if using secondary addresses for ipv6
	sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
	tcp: eliminate negative reordering in tcp_clean_rtx_queue
	net: Improve handling of failures on link and route dumps
	ipv6: Prevent overrun when parsing v6 header options
	ipv6: Check ip6_find_1stfragopt() return value properly.
	bridge: netlink: check vlan_default_pvid range
	qmi_wwan: add another Lenovo EM74xx device ID
	bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
	ipv6: fix out of bound writes in __ip6_append_data()
	be2net: Fix offload features for Q-in-Q packets
	virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
	tcp: avoid fastopen API to be used on AF_UNSPEC
	sctp: fix ICMP processing if skb is non-linear
	ipv4: add reference counting to metrics
	netem: fix skb_orphan_partial()
	net: phy: marvell: Limit errata to 88m1101
	vlan: Fix tcp checksum offloads in Q-in-Q vlans
	i2c: i2c-tiny-usb: fix buffer not being DMA capable
	mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read
	HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference
	scsi: mpt3sas: Force request partial completion alignment
	drm/radeon/ci: disable mclk switching for high refresh rates (v2)
	drm/radeon: Unbreak HPD handling for r600+
	pcmcia: remove left-over %Z format
	ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
	slub/memcg: cure the brainless abuse of sysfs attributes
	drm/gma500/psb: Actually use VBT mode when it is found
	mm/migrate: fix refcount handling when !hugepage_migration_supported()
	mlock: fix mlock count can not decrease in race condition
	xfs: Fix missed holes in SEEK_HOLE implementation
	xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
	xfs: fix over-copying of getbmap parameters from userspace
	xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
	xfs: prevent multi-fsb dir readahead from reading random blocks
	xfs: fix up quotacheck buffer list error handling
	xfs: support ability to wait on new inodes
	xfs: update ag iterator to support wait on new inodes
	xfs: wait on new inodes during quotaoff dquot release
	xfs: fix indlen accounting error on partial delalloc conversion
	xfs: bad assertion for delalloc an extent that start at i_size
	xfs: fix unaligned access in xfs_btree_visit_blocks
	xfs: in _attrlist_by_handle, copy the cursor back to userspace
	xfs: only return -errno or success from attr ->put_listent
	Linux 4.4.71

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-07 12:36:01 +02:00
Soheil Hassas Yeganeh
7ede5c90fc tcp: eliminate negative reordering in tcp_clean_rtx_queue
[ Upstream commit bafbb9c73241760023d8981191ddd30bb1c6dbac ]

tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.

Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.

Fixes: c7caf8d3ed ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: Rebecca Isaacs <risaacs@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:05:57 +02:00
Yuchung Cheng
90e3f8a558 tcp: avoid fragmenting peculiar skbs in SACK
[ Upstream commit b451e5d24ba6687c6f0e7319c727a709a1846c06 ]

This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.

The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.

Fixes: adb92db857 ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:05:57 +02:00
Greg Kroah-Hartman
29950430ce This is the 4.4.58 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAljctYYACgkQONu9yGCS
 aT6MbxAAwyGobI5sOr63yX5Myji1jf17vlY2h5dXet+8lu/csbFKmqHxaTLNwGIw
 7u6V3AJ4zWdX8Q22dcVC98oySxcLxUhv+Rv/Dbonr3CYM00wNIex2wzON8f77eJQ
 CBGJRNJR4/VG6opbVI/qp0t/c2oFiqHJXPldm3/Ru7jcBrLo5UHDWDY6cDhrj/Tg
 F1maCBMAu1qW0z9KTnrQDvHjPHXmKfCviGzXpFTSVBQrh1s1bJkZkTqcY9eZNa/u
 AXhHek5ZLFxlhkO105leR0YtXADbopiJ5c4EgXCASzNQ92/6IKsl21eOhHgOU9OA
 YUCYftwKVMcxXGB6QFbdefLVtnCjUtlDa9+70oW1/4Ecee5FUBzNjVWVHOtYEsTY
 pA3DqQI+U7EBCuIXsTtV0DiRWhHKq5uS1aphXZwnq/8qc2A3PD86JV/MBK5sWZfB
 2V1N7xkitLFFCR6vMFLuusM8Np7kJ3zaAxQOd3IRc72iiNLkbNjdfJcAQ+E9b/Zx
 5tpcthOl2RKhlOHHVKmYIioar8+RkZgWWl64+RTt6M1KjvHs07lPdCI+4cW8ELLM
 /FUeRNTLmOiUv4dEPj5INYukEcLuCNp4fIo9lq8HrDceXbJXwiMFtHHlo4SQ/ubm
 9v5iYdEmGet8jYPrfa5LDtD7G8K//k08nElVmI6N/4fNxRC2GcY=
 =TI1/
 -----END PGP SIGNATURE-----

Merge 4.4.48 into android-4.4

Changes in 4.4.48:
	net/openvswitch: Set the ipv6 source tunnel key address attribute correctly
	net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled
	net: properly release sk_frag.page
	amd-xgbe: Fix jumbo MTU processing on newer hardware
	net: unix: properly re-increment inflight counter of GC discarded candidates
	net/mlx5: Increase number of max QPs in default profile
	net/mlx5e: Count LRO packets correctly
	net: bcmgenet: remove bcmgenet_internal_phy_setup()
	ipv4: provide stronger user input validation in nl_fib_input()
	socket, bpf: fix sk_filter use after free in sk_clone_lock
	tcp: initialize icsk_ack.lrcvtime at session start time
	Input: elan_i2c - add ASUS EeeBook X205TA special touchpad fw
	Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000
	Input: iforce - validate number of endpoints before using them
	Input: ims-pcu - validate number of endpoints before using them
	Input: hanwang - validate number of endpoints before using them
	Input: yealink - validate number of endpoints before using them
	Input: cm109 - validate number of endpoints before using them
	Input: kbtab - validate number of endpoints before using them
	Input: sur40 - validate number of endpoints before using them
	ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
	ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call
	ALSA: hda - Adding a group of pin definition to fix headset problem
	USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
	USB: serial: qcserial: add Dell DW5811e
	ACM gadget: fix endianness in notifications
	usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
	usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
	USB: uss720: fix NULL-deref at probe
	USB: lvtest: fix NULL-deref at probe
	USB: idmouse: fix NULL-deref at probe
	USB: wusbcore: fix NULL-deref at probe
	usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
	usb: hub: Fix crash after failure to read BOS descriptor
	uwb: i1480-dfu: fix NULL-deref at probe
	uwb: hwa-rc: fix NULL-deref at probe
	mmc: ushc: fix NULL-deref at probe
	iio: adc: ti_am335x_adc: fix fifo overrun recovery
	iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3
	parport: fix attempt to write duplicate procfiles
	ext4: mark inode dirty after converting inline directory
	mmc: sdhci: Do not disable interrupts while waiting for clock
	xen/acpi: upload PM state from init-domain to Xen
	iommu/vt-d: Fix NULL pointer dereference in device_to_iommu
	ARM: at91: pm: cpu_idle: switch DDR to power-down mode
	ARM: dts: at91: sama5d2: add dma properties to UART nodes
	cpufreq: Restore policy min/max limits on CPU online
	raid10: increment write counter after bio is split
	libceph: don't set weight to IN when OSD is destroyed
	xfs: don't allow di_size with high bit set
	xfs: fix up xfs_swap_extent_forks inline extent handling
	nl80211: fix dumpit error path RTNL deadlocks
	USB: usbtmc: add missing endpoint sanity check
	xfs: clear _XBF_PAGES from buffers when readahead page
	xen: do not re-use pirq number cached in pci device msi msg data
	igb: Workaround for igb i210 firmware issue
	igb: add i211 to i210 PHY workaround
	x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
	PCI: Separate VF BAR updates from standard BAR updates
	PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
	PCI: Add comments about ROM BAR updating
	PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
	PCI: Don't update VF BARs while VF memory space is enabled
	PCI: Update BARs using property bits appropriate for type
	PCI: Ignore BAR updates on virtual functions
	PCI: Do any VF BAR updates before enabling the BARs
	vfio/spapr: Postpone allocation of userspace version of TCE table
	block: allow WRITE_SAME commands with the SG_IO ioctl
	s390/zcrypt: Introduce CEX6 toleration
	uvcvideo: uvc_scan_fallback() for webcams with broken chain
	ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520
	ACPI / blacklist: Make Dell Latitude 3350 ethernet work
	serial: 8250_pci: Detach low-level driver during PCI error recovery
	fbcon: Fix vc attr at deinit
	crypto: algif_hash - avoid zero-sized array
	Linux 4.4.58

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-03-30 13:18:20 +02:00
Eric Dumazet
afaed24192 tcp: initialize icsk_ack.lrcvtime at session start time
[ Upstream commit 15bb7745e94a665caf42bfaabf0ce062845b533b ]

icsk_ack.lrcvtime has a 0 value at socket creation time.

tcpi_last_data_recv can have bogus value if no payload is ever received.

This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-30 09:35:14 +02:00
Dmitry Shmidt
324e88de4a This is the 4.4.32 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJYKq+PAAoJEDjbvchgkmk+W3sQAKHJ6dI10P/sFTe4AlGoRGNr
 ZtCwGwwolBoD/NtXa2HCovc9ofIU4zWYXl5P+kbHtKV/ZB4q5+m7Q5bpWh4TQFUy
 9TKho6aywF9uXpAEV99qKYvAOIq5EgJXdgrhCRTYBBR9+uR3+B1cUJhxpyD6htw4
 H7ABpmihWjij0o9YYAin7y/O+8jeqnuNLPUoCek1Emf0cn7G5keMg8Lli0WCz7jM
 JdKOjbvaYscgvb4BqTKqtg5NneC3GoeNp43Kvz4LbmcPw1yT5N8sHswqlSio4U2U
 Sxyvtj0RxoSoAus2UR62pTGDu1TrSHxWEWpYpqa77hr1/TpBY7put1OldFmUfu1B
 voQUI05Ox74RT9pl5c8DGnXH8Zyiu6a7Fpj6EdWbWxtbIgvWCLaDHniEY1WKR6cj
 Bmil/zjGyDtzANJBasC9NJHF8yd+/vxNfn5n0eAz6Xp94MIdOGPIQle+NATG5osN
 0b/NLit64B2F6Djijkv1vV9V7x1oYqIYVG6f1BoVtRXCjhcx9PnkskXcP+1SKUhH
 xOTXLt6rGNaTj+T2/41VJUtZ6eiZj+0GZMXILu5SIEdKiRiGLfsLHX117OK3ZhYT
 PFzzzWZoC2FOL/ldp/K6ncPZV0oHn3yfQa3T97jGI1LbsYkXXyQkW5PNwqGccbUc
 xvhEAPDvBxDlfcgqWMaw
 =DC+B
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.32' into android-4.4.y

This is the 4.4.32 stable release

Change-Id: I5028402eadfcf055ac44a5e67abc6da75b2068b3
2016-11-15 17:02:38 -08:00
Eric Dumazet
aadcd6a960 tcp: fix a compile error in DBGUNDO()
[ Upstream commit 019b1c9fe32a2a32c1153e31375f87ec3e591273 ]

If DBGUNDO() is enabled (FASTRETRANS_DEBUG > 1), a compile
error will happen, since inet6_sk(sk)->daddr became sk->sk_v6_daddr

Fixes: efe4208f47 ("ipv6: make lookups simpler and faster")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-15 07:46:36 +01:00
Dmitry Shmidt
aa349c0a96 This is the 4.4.19 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXuIDJAAoJEDjbvchgkmk+i10QALySg/PFXDJ6AwUskGbetHBz
 RnsJ8WzjtzBR5vAyaru2vkD/GhFmM3ziG8guQK3uWGhhfpB+CPJjDmYIY1O5Djma
 CviyB6UsEIuf2zN7U70WSmjJ/FyD7XRqjGnEX9u5YGS4WQTFPnPttE4HE82ErEEW
 IocnBGFZriGye9D/2O6OjTDgIusLsZ6WKawK0OyeKiUrTUsmhLBtW0nfMHd/snNw
 4Aas0j6g5tjYrNBUyKqmkYhi7S2kFyZ7QH1vqrXxUHu4CNslTa6i1VTkQ+uVxbuF
 Vw9DLP6KEmB/Q5KyIVFMmEv6E5vvgymv7rrQ4c7pu6vqmHzbdtaWxZFM18EnIXOk
 qe8/9wzF4ahw+h/0ddmjpjmWi/SRYG8PmobgTWmIqJl+SNq4VK2G/GRkWce45EDi
 lMO6UI4qUd8vMw1OJOdKwp8C/D+l5V1qrVlQTVba8IJsH2fKFw9aSKAGwpppawfl
 CiESwHhSINGfhGzDyYS/keo1JM0KDyGc3EYQG5DaSzNZu4jqkhNPjBlQEOJug3/I
 6LDrWQo4+qC6vJJ836NyRvakv1WDL8AsHmTOuiW8h8LzcGsaxac9L7HMRgwItXAs
 aWTXg2eBoJXkBQalglvhSzGqBJl2ytlu0Efxg97zEL1huZuYDdzf9tO7hqMujZhc
 k+SnQTS6JXVuDe46uDyb
 =JLSE
 -----END PGP SIGNATURE-----

Merge tag 'v4.4.19' into android-4.4.y

This is the 4.4.19 stable release
2016-08-22 14:09:08 -07:00
Jason Baron
5413f1a526 tcp: enable per-socket rate limiting of all 'challenge acks'
[ Upstream commit 083ae308280d13d187512b9babe3454342a7987e ]

The per-socket rate limit for 'challenge acks' was introduced in the
context of limiting ack loops:

commit f2b2c582e8 ("tcp: mitigate ACK loops for connections as tcp_sock")

And I think it can be extended to rate limit all 'challenge acks' on a
per-socket basis.

Since we have the global tcp_challenge_ack_limit, this patch allows for
tcp_challenge_ack_limit to be set to a large value and effectively rely on
the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
still prevents a single connections from consuming the entire challenge ack
quota.

It further moves in the direction of eliminating the global limit at some
point, as Eric Dumazet has suggested. This a follow-up to:
Subject: tcp: make challenge acks less predictable

Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:47 +02:00
Eric Dumazet
72c2d3bcca tcp: make challenge acks less predictable
[ Upstream commit 75ff39ccc1bd5d3c455b6822ab09e533c551f758 ]

Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.

This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.

Based on initial analysis and patch from Linus.

Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.

v2: randomize the count of challenge acks per second, not the period.

Fixes: 282f23c6ee ("tcp: implement RFC 5961 3.2")
Reported-by: Yue Cao <ycao009@ucr.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16 09:30:47 +02:00
JP Abgrall
72fc8653dc tcp: add a sysctl to config the tcp_default_init_rwnd
The default initial rwnd is hardcoded to 10.

Now we allow it to be controlled via
  /proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100

This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.

See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf

Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>

Conflicts:
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/tcp_input.c
2016-02-16 13:51:43 -08:00
Yuchung Cheng
8b8a321ff7 tcp: fix zero cwnd in tcp_cwnd_reduction
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().

To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:

1) The proportional reduction mode
   inflight > ssthresh > 0

2) The reduction bound mode
  a) inflight == ssthresh > 0

  b) inflight < ssthresh
     sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.

In reality this bug is triggered only with a sequence of less common
events.  For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.

Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:39:56 -05:00
Eric Dumazet
142a2e7ece tcp: initialize tp->copied_seq in case of cross SYN connection
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :

WARN_ON(tp->copied_seq != tp->rcv_nxt &&
        !(flags & (MSG_PEEK | MSG_TRUNC)));

His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.

Thanks again Dmitry for your report and testings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30 15:34:17 -05:00
Eric Dumazet
5d4c9bfbab tcp: fix potential huge kmalloc() calls in TCP_REPAIR
tcp_send_rcvq() is used for re-injecting data into tcp receive queue.

Problems :

- No check against size is performed, allowed user to fool kernel in
  attempting very large memory allocations, eventually triggering
  OOM when memory is fragmented.

- In case of fault during the copy we do not return correct errno.

Lets use alloc_skb_with_frags() to cook optimal skbs.

Fixes: 292e8d8c85 ("tcp: Move rcvq sending to tcp_input.c")
Fixes: c0e88ff0f2 ("tcp: Repair socket queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:57:33 -05:00
Yuchung Cheng
4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng
659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng
77c631273d tcp: add tcp_tsopt_ecr_before helper
a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:45 -07:00
Yuchung Cheng
af82f4e848 tcp: remove tcp_mark_lost_retrans()
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:44 -07:00
Yuchung Cheng
f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Yuchung Cheng
9e45a3e36b tcp: apply Kern's check on RTTs used for congestion control
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:41 -07:00
Eric Dumazet
dc6ef6be52 tcp: do not set queue_mapping on SYNACK
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:26:02 -07:00
Eric Dumazet
ed53d0ab76 net: shrink struct sock and request_sock by 8 bytes
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:28:22 -07:00
Eric Dumazet
a1a5344ddb tcp: avoid two atomic ops for syncookies
inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:27 -07:00
Eric Dumazet
7656d842de tcp: fix fastopen races vs lockless listener
There are multiple races that need fixes :

1) skb_get() + queue skb + kfree_skb() is racy

An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.

Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.

3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af8810 ("tcp: avoid reorders for TFO passive connections")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:24 -07:00
Eric Dumazet
ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet
8d2675f1e4 tcp: move synflood_warned into struct request_sock_queue
long term plan is to remove struct listen_sock when its hash
table is no longer there.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:37 -07:00