Commit graph

106 commits

Author SHA1 Message Date
Srinivasarao P
38cacfd106 Merge android-4.4.114 (fe09418) into msm-4.4
* refs/heads/tmp-fe09418
  Linux 4.4.114
  nfsd: auth: Fix gid sorting when rootsquash enabled
  net: tcp: close sock if net namespace is exiting
  flow_dissector: properly cap thoff field
  ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
  net: Allow neigh contructor functions ability to modify the primary_key
  vmxnet3: repair memory leak
  sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
  sctp: do not allow the v4 socket to bind a v4mapped v6 address
  r8169: fix memory corruption on retrieval of hardware statistics.
  pppoe: take ->needed_headroom of lower device into account on xmit
  net: qdisc_pkt_len_init() should be more robust
  tcp: __tcp_hdrlen() helper
  net: igmp: fix source address check for IGMPv3 reports
  lan78xx: Fix failure in USB Full Speed
  ipv6: ip6_make_skb() needs to clear cork.base.dst
  ipv6: fix udpv6 sendmsg crash caused by too small MTU
  ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
  dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
  hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  x86/microcode/intel: Extend BDW late-loading further with LLC size check
  eventpoll.h: add missing epoll event masks
  vsyscall: Fix permissions for emulate mode with KAISER/PTI
  um: link vmlinux with -no-pie
  usbip: prevent leaking socket pointer address in messages
  usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
  usbip: fix stub_rx: get_pipe() to validate endpoint number
  usb: usbip: Fix possible deadlocks reported by lockdep
  Input: trackpoint - force 3 buttons if 0 button is reported
  Revert "module: Add retpoline tag to VERMAGIC"
  scsi: libiscsi: fix shifting of DID_REQUEUE host byte
  fs/fcntl: f_setown, avoid undefined behaviour
  reiserfs: Don't clear SGID when inheriting ACLs
  reiserfs: don't preallocate blocks for extended attributes
  reiserfs: fix race in prealloc discard
  ext2: Don't clear SGID when inheriting ACLs
  netfilter: xt_osf: Add missing permission checks
  netfilter: nfnetlink_cthelper: Add missing permission checks
  netfilter: fix IS_ERR_VALUE usage
  netfilter: use fwmark_reflect in nf_send_reset
  netfilter: nf_conntrack_sip: extend request line validation
  netfilter: restart search if moved to other chain
  netfilter: nfnetlink_queue: reject verdict request from different portid
  netfilter: nf_ct_expect: remove the redundant slash when policy name is empty
  netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
  netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel
  netfilter: x_tables: speed up jump target validation
  ACPICA: Namespace: fix operand cache leak
  ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
  ACPI / processor: Avoid reserving IO regions too early
  x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
  ipc: msg, make msgrcv work with LONG_MIN
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  cma: fix calculation of aligned offset
  hwpoison, memcg: forcibly uncharge LRU pages
  mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack
  fs/select: add vmalloc fallback for select(2)
  mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
  PCI: layerscape: Fix MSG TLP drop setting
  PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
  drivers: base: cacheinfo: fix boot error message when acpi is enabled
  drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
  Prevent timer value 0 for MWAITX
  timers: Plug locking race vs. timer migration
  time: Avoid undefined behaviour in ktime_add_safe()
  PM / sleep: declare __tracedata symbols as char[] rather than char
  can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
  can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
  sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
  x86/retpoline: Fill RSB on context switch for affected CPUs
  x86/cpu/intel: Introduce macros for Intel family numbers
  x86/microcode/intel: Fix BDW late-loading revision check
  usbip: Fix potential format overflow in userspace tools
  usbip: Fix implicit fallthrough warning
  usbip: prevent vhci_hcd driver from leaking a socket pointer address
  x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
  ANDROID: sched: EAS: check energy_aware() before calling select_energy_cpu_brute() in up-migrate path
  UPSTREAM: eventpoll.h: add missing epoll event masks
  ANDROID: xattr: Pass EOPNOTSUPP to permission2

Conflicts:
	kernel/sched/fair.c

Change-Id: I15005cb3bc039f4361d25ed2e22f8175b3d7ca96
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2018-02-01 14:02:45 +05:30
Dan Streetman
edaafa805e net: tcp: close sock if net namespace is exiting
[ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31 12:06:14 +01:00
Blagovest Kolenichev
901bf6ddcc Merge android-4.4@4b8fc9f (v4.4.82) into msm-4.4
* refs/heads/tmp-4b8fc9f
  UPSTREAM: locking: avoid passing around 'thread_info' in mutex debugging code
  ANDROID: arm64: fix undeclared 'init_thread_info' error
  UPSTREAM: kdb: use task_cpu() instead of task_thread_info()->cpu
  Linux 4.4.82
  net: account for current skb length when deciding about UFO
  ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
  mm/mempool: avoid KASAN marking mempool poison checks as use-after-free
  KVM: arm/arm64: Handle hva aging while destroying the vm
  sparc64: Prevent perf from running during super critical sections
  udp: consistently apply ufo or fragmentation
  revert "ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output"
  revert "net: account for current skb length when deciding about UFO"
  packet: fix tp_reserve race in packet_set_ring
  net: avoid skb_warn_bad_offload false positives on UFO
  tcp: fastopen: tcp_connect() must refresh the route
  net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
  bpf, s390: fix jit branch offset related to ldimm64
  net: fix keepalive code vs TCP_FASTOPEN_CONNECT
  tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states
  ANDROID: keychord: Fix for a memory leak in keychord.
  ANDROID: keychord: Fix races in keychord_write.
  Use %zu to print resid (size_t).
  ANDROID: keychord: Fix a slab out-of-bounds read.
  Linux 4.4.81
  workqueue: implicit ordered attribute should be overridable
  net: account for current skb length when deciding about UFO
  ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
  mm: don't dereference struct page fields of invalid pages
  signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
  lib/Kconfig.debug: fix frv build failure
  mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
  ARM: 8632/1: ftrace: fix syscall name matching
  virtio_blk: fix panic in initialization error path
  drm/virtio: fix framebuffer sparse warning
  scsi: qla2xxx: Get mutex lock before checking optrom_state
  phy state machine: failsafe leave invalid RUNNING state
  x86/boot: Add missing declaration of string functions
  tg3: Fix race condition in tg3_get_stats64().
  net: phy: dp83867: fix irq generation
  sh_eth: R8A7740 supports packet shecksumming
  wext: handle NULL extra data in iwe_stream_add_point better
  sparc64: Measure receiver forward progress to avoid send mondo timeout
  xen-netback: correctly schedule rate-limited queues
  net: phy: Fix PHY unbind crash
  net: phy: Correctly process PHY_HALTED in phy_stop_machine()
  net/mlx5: Fix command bad flow on command entry allocation failure
  sctp: fix the check for _sctp_walk_params and _sctp_walk_errors
  sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}()
  dccp: fix a memleak for dccp_feat_init err process
  dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly
  dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly
  net: ethernet: nb8800: Handle all 4 RGMII modes identically
  ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()
  packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
  openvswitch: fix potential out of bound access in parse_ct
  mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled
  rtnetlink: allocate more memory for dev_set_mac_address()
  ipv4: initialize fib_trie prior to register_netdev_notifier call.
  ipv6: avoid overflow of offset in ip6_find_1stfragopt
  net: Zero terminate ifr_name in dev_ifname().
  ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check()
  saa7164: fix double fetch PCIe access condition
  drm: rcar-du: fix backport bug
  f2fs: sanity check checkpoint segno and blkoff
  media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds
  mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
  iser-target: Avoid isert_conn->cm_id dereference in isert_login_recv_done
  iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP
  iscsi-target: Fix initial login PDU asynchronous socket close OOPs
  iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race
  iscsi-target: Always wait for kthread_should_stop() before kthread exit
  target: Avoid mappedlun symlink creation during lun shutdown
  media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl
  ARM: dts: armada-38x: Fix irq type for pca955
  ext4: fix overflow caused by missing cast in ext4_resize_fs()
  ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
  mm/page_alloc: Remove kernel address exposure in free_reserved_area()
  KVM: async_pf: make rcu irq exit if not triggered from idle task
  ASoC: do not close shared backend dailink
  ALSA: hda - Fix speaker output from VAIO VPCL14M1R
  workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
  libata: array underflow in ata_find_dev()
  ANDROID: binder: don't queue async transactions to thread.
  ANDROID: binder: don't enqueue death notifications to thread todo.
  ANDROID: binder: call poll_wait() unconditionally.
  android: configs: move quota-related configs to recommended
  BACKPORT: arm64: split thread_info from task stack
  UPSTREAM: arm64: assembler: introduce ldr_this_cpu
  UPSTREAM: arm64: make cpu number a percpu variable
  UPSTREAM: arm64: smp: prepare for smp_processor_id() rework
  BACKPORT: arm64: move sp_el0 and tpidr_el1 into cpu_suspend_ctx
  UPSTREAM: arm64: prep stack walkers for THREAD_INFO_IN_TASK
  UPSTREAM: arm64: unexport walk_stackframe
  UPSTREAM: arm64: traps: simplify die() and __die()
  UPSTREAM: arm64: factor out current_stack_pointer
  BACKPORT: arm64: asm-offsets: remove unused definitions
  UPSTREAM: arm64: thread_info remove stale items
  UPSTREAM: thread_info: include <current.h> for THREAD_INFO_IN_TASK
  UPSTREAM: thread_info: factor out restart_block
  UPSTREAM: kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  UPSTREAM: sched/core: Add try_get_task_stack() and put_task_stack()
  UPSTREAM: sched/core: Allow putting thread_info into task_struct
  UPSTREAM: printk: when dumping regs, show the stack, not thread_info
  UPSTREAM: fix up initial thread stack pointer vs thread_info confusion
  UPSTREAM: Clarify naming of thread info/stack allocators
  ANDROID: sdcardfs: override credential for ioctl to lower fs

Conflicts:
	android/configs/android-base.cfg
	arch/arm64/Kconfig
	arch/arm64/include/asm/suspend.h
	arch/arm64/kernel/head.S
	arch/arm64/kernel/smp.c
	arch/arm64/kernel/suspend.c
	arch/arm64/kernel/traps.c
	arch/arm64/mm/proc.S
	kernel/fork.c
	sound/soc/soc-pcm.c

Change-Id: I273e216c94899a838bbd208391c6cbe20b2bf683
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2017-09-01 11:47:49 -07:00
Eric Dumazet
4e0675f44b net: fix keepalive code vs TCP_FASTOPEN_CONNECT
[ Upstream commit 2dda640040876cd8ae646408b69eea40c24f9ae9 ]

syzkaller was able to trigger a divide by 0 in TCP stack [1]

Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11

[1]
 divide error: 0000 [#1] SMP
 CPU: 18 PID: 0 Comm: swapper/18 Not tainted
 task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
 RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
 Call Trace:
  <IRQ>
  [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
  [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
  [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
  [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
  [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
  [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
  [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
  [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
  [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
  [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
  <EOI>
  [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
  [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0

Tested:

Following packetdrill no longer crashes the kernel

`echo 0 >/proc/sys/net/ipv4/tcp_timestamps`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
   +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
 +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
   +0 > . 1:1(0) ack 1
   +0 close(3) = 0
   +0 > F. 1:1(0) ack 1
   +0 < F. 1:1(0) ack 2 win 92
   +0 > .  2:2(0) ack 2

   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 +.01 connect(4, ..., ...) = 0
   +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
   +10 close(4) = 0

`echo 1 >/proc/sys/net/ipv4/tcp_timestamps`

Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-12 19:29:08 -07:00
Blagovest Kolenichev
b47135257c Merge branch 'android-4.4@c71ad0f' into branch 'msm-4.4'
* refs/heads/tmp-c71ad0f:
  BACKPORT: arm64: dts: juno: fix cluster sleep state entry latency on all SoC versions
  staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
  ANDROID: sdcardfs: update module info
  ANDROID: sdcardfs: use d_splice_alias
  ANDROID: sdcardfs: add read_iter/write_iter opeations
  ANDROID: sdcardfs: fix ->llseek to update upper and lower offset
  ANDROID: sdcardfs: copy lower inode attributes in ->ioctl
  ANDROID: sdcardfs: remove unnecessary call to do_munmap
  Merge 4.4.59 into android-4.4
  UPSTREAM: ipv6 addrconf: implement RFC7559 router solicitation backoff
  android: base-cfg: enable CONFIG_INET_DIAG_DESTROY
  ANDROID: android-base.cfg: add CONFIG_MODULES option
  ANDROID: android-base.cfg: add CONFIG_IKCONFIG option
  ANDROID: android-base.cfg: properly sort the file
  ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.
  ANDROID: sort android-recommended.cfg
  UPSTREAM: config/android: Remove CONFIG_IPV6_PRIVACY
  UPSTREAM: config: android: set SELinux as default security mode
  config: android: move device mapper options to recommended
  ANDROID: ARM64: Allow to choose appended kernel image
  UPSTREAM: arm64: vdso: constify vm_special_mapping used for aarch32 vectors page
  UPSTREAM: arm64: vdso: add __init section marker to alloc_vectors_page
  UPSTREAM: ARM: 8597/1: VDSO: put RO and RO after init objects into proper sections
  UPSTREAM: arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO
  UPSTREAM: arm64: Refactor vDSO time functions
  UPSTREAM: arm64: fix vdso-offsets.h dependency
  UPSTREAM: kbuild: drop FORCE from PHONY targets
  UPSTREAM: mm: add PHYS_PFN, use it in __phys_to_pfn()
  UPSTREAM: ARM: 8476/1: VDSO: use PTR_ERR_OR_ZERO for vma check
  Linux 4.4.58
  crypto: algif_hash - avoid zero-sized array
  fbcon: Fix vc attr at deinit
  serial: 8250_pci: Detach low-level driver during PCI error recovery
  ACPI / blacklist: Make Dell Latitude 3350 ethernet work
  ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520
  uvcvideo: uvc_scan_fallback() for webcams with broken chain
  s390/zcrypt: Introduce CEX6 toleration
  block: allow WRITE_SAME commands with the SG_IO ioctl
  vfio/spapr: Postpone allocation of userspace version of TCE table
  PCI: Do any VF BAR updates before enabling the BARs
  PCI: Ignore BAR updates on virtual functions
  PCI: Update BARs using property bits appropriate for type
  PCI: Don't update VF BARs while VF memory space is enabled
  PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
  PCI: Add comments about ROM BAR updating
  PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
  PCI: Separate VF BAR updates from standard BAR updates
  x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
  igb: add i211 to i210 PHY workaround
  igb: Workaround for igb i210 firmware issue
  xen: do not re-use pirq number cached in pci device msi msg data
  xfs: clear _XBF_PAGES from buffers when readahead page
  USB: usbtmc: add missing endpoint sanity check
  nl80211: fix dumpit error path RTNL deadlocks
  xfs: fix up xfs_swap_extent_forks inline extent handling
  xfs: don't allow di_size with high bit set
  libceph: don't set weight to IN when OSD is destroyed
  raid10: increment write counter after bio is split
  cpufreq: Restore policy min/max limits on CPU online
  ARM: dts: at91: sama5d2: add dma properties to UART nodes
  ARM: at91: pm: cpu_idle: switch DDR to power-down mode
  iommu/vt-d: Fix NULL pointer dereference in device_to_iommu
  xen/acpi: upload PM state from init-domain to Xen
  mmc: sdhci: Do not disable interrupts while waiting for clock
  ext4: mark inode dirty after converting inline directory
  parport: fix attempt to write duplicate procfiles
  iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3
  iio: adc: ti_am335x_adc: fix fifo overrun recovery
  mmc: ushc: fix NULL-deref at probe
  uwb: hwa-rc: fix NULL-deref at probe
  uwb: i1480-dfu: fix NULL-deref at probe
  usb: hub: Fix crash after failure to read BOS descriptor
  usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
  USB: wusbcore: fix NULL-deref at probe
  USB: idmouse: fix NULL-deref at probe
  USB: lvtest: fix NULL-deref at probe
  USB: uss720: fix NULL-deref at probe
  usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
  usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
  ACM gadget: fix endianness in notifications
  USB: serial: qcserial: add Dell DW5811e
  USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
  ALSA: hda - Adding a group of pin definition to fix headset problem
  ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call
  ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
  Input: sur40 - validate number of endpoints before using them
  Input: kbtab - validate number of endpoints before using them
  Input: cm109 - validate number of endpoints before using them
  Input: yealink - validate number of endpoints before using them
  Input: hanwang - validate number of endpoints before using them
  Input: ims-pcu - validate number of endpoints before using them
  Input: iforce - validate number of endpoints before using them
  Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000
  Input: elan_i2c - add ASUS EeeBook X205TA special touchpad fw
  tcp: initialize icsk_ack.lrcvtime at session start time
  socket, bpf: fix sk_filter use after free in sk_clone_lock
  ipv4: provide stronger user input validation in nl_fib_input()
  net: bcmgenet: remove bcmgenet_internal_phy_setup()
  net/mlx5e: Count LRO packets correctly
  net/mlx5: Increase number of max QPs in default profile
  net: unix: properly re-increment inflight counter of GC discarded candidates
  amd-xgbe: Fix jumbo MTU processing on newer hardware
  net: properly release sk_frag.page
  net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled
  net/openvswitch: Set the ipv6 source tunnel key address attribute correctly
  Linux 4.4.57
  ext4: fix fencepost in s_first_meta_bg validation
  percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
  gfs2: Avoid alignment hole in struct lm_lockname
  isdn/gigaset: fix NULL-deref at probe
  target: Fix VERIFY_16 handling in sbc_parse_cdb
  scsi: libiscsi: add lock around task lists to fix list corruption regression
  scsi: lpfc: Add shutdown method for kexec
  target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export
  md/raid1/10: fix potential deadlock
  powerpc/boot: Fix zImage TOC alignment
  cpufreq: Fix and clean up show_cpuinfo_cur_freq()
  perf/core: Fix event inheritance on fork()
  give up on gcc ilog2() constant optimizations
  kernek/fork.c: allocate idle task for a CPU always on its local node
  hv_netvsc: use skb_get_hash() instead of a homegrown implementation
  tpm_tis: Use devm_free_irq not free_irq
  drm/amdgpu: add missing irq.h include
  s390/pci: fix use after free in dma_init
  KVM: PPC: Book3S PR: Fix illegal opcode emulation
  xen/qspinlock: Don't kick CPU if IRQ is not initialized
  Drivers: hv: avoid vfree() on crash
  Drivers: hv: balloon: don't crash when memory is added in non-sorted order
  pinctrl: cherryview: Do not mask all interrupts in probe
  ACPI / video: skip evaluating _DOD when it does not exist
  cxlflash: Increase cmd_per_lun for better throughput
  crypto: mcryptd - Fix load failure
  crypto: cryptd - Assign statesize properly
  crypto: ghash-clmulni - Fix load failure
  USB: don't free bandwidth_mutex too early
  usb: core: hub: hub_port_init lock controller instead of bus
  ANDROID: sdcardfs: Fix style issues in macros
  ANDROID: sdcardfs: Use seq_puts over seq_printf
  ANDROID: sdcardfs: Use to kstrout
  ANDROID: sdcardfs: Use pr_[...] instead of printk
  ANDROID: sdcardfs: remove unneeded null check
  ANDROID: sdcardfs: Fix style issues with comments
  ANDROID: sdcardfs: Fix formatting
  ANDROID: sdcardfs: correct order of descriptors
  fix the deadlock in xt_qtaguid when enable DDEBUG
  net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.
  Linux 4.4.56
  futex: Add missing error handling to FUTEX_REQUEUE_PI
  futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
  x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
  x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y
  fscrypto: lock inode while setting encryption policy
  fscrypt: fix renaming and linking special files
  net sched actions: decrement module reference count after table flush.
  dccp: fix memory leak during tear-down of unsuccessful connection request
  dccp/tcp: fix routing redirect race
  bridge: drop netfilter fake rtable unconditionally
  ipv6: avoid write to a possibly cloned skb
  ipv6: make ECMP route replacement less greedy
  mpls: Send route delete notifications when router module is unloaded
  act_connmark: avoid crashing on malformed nlattrs with null parms
  uapi: fix linux/packet_diag.h userspace compilation error
  vrf: Fix use-after-free in vrf_xmit
  dccp: fix use-after-free in dccp_feat_activate_values
  net: fix socket refcounting in skb_complete_tx_timestamp()
  net: fix socket refcounting in skb_complete_wifi_ack()
  tcp: fix various issues for sockets morphing to listen state
  dccp: Unlock sock before calling sk_free()
  net: net_enable_timestamp() can be called from irq contexts
  net: don't call strlen() on the user buffer in packet_bind_spkt()
  l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
  ipv4: mask tos for input route
  vti6: return GRE_KEY for vti6
  vxlan: correctly validate VXLAN ID against VXLAN_N_VID
  netlink: remove mmapped netlink support
  ANDROID: mmc: core: export emmc revision
  BACKPORT: mmc: core: Export device lifetime information through sysfs
  ANDROID: android-verity: do not compile as independent module
  ANDROID: sched: fix duplicate sched_group_energy const specifiers
  config: disable CONFIG_USELIB and CONFIG_FHANDLE
  ANDROID: power: align wakeup_sources format
  ANDROID: dm: android-verity: allow disable dm-verity for Treble VTS
  uid_sys_stats: change to use rt_mutex
  ANDROID: vfs: user permission2 in notify_change2
  ANDROID: sdcardfs: Fix gid issue
  ANDROID: sdcardfs: Use tabs instead of spaces in multiuser.h
  ANDROID: sdcardfs: Remove uninformative prints
  ANDROID: sdcardfs: move path_put outside of spinlock
  ANDROID: sdcardfs: Use case insensitive hash function
  ANDROID: sdcardfs: declare MODULE_ALIAS_FS
  ANDROID: sdcardfs: Get the blocksize from the lower fs
  ANDROID: sdcardfs: Use d_invalidate instead of drop_recurisve
  ANDROID: sdcardfs: Switch to internal case insensitive compare
  ANDROID: sdcardfs: Use spin_lock_nested
  ANDROID: sdcardfs: Replace get/put with d_lock
  ANDROID: sdcardfs: rate limit warning print
  ANDROID: sdcardfs: Fix case insensitive lookup
  ANDROID: uid_sys_stats: account for fsync syscalls
  ANDROID: sched: add a counter to track fsync
  ANDROID: uid_sys_stats: fix negative write bytes.
  ANDROID: uid_sys_stats: allow writing same state
  ANDROID: uid_sys_stats: rename uid_cputime.c to uid_sys_stats.c
  ANDROID: uid_cputime: add per-uid IO usage accounting
  DTB: Add EAS compatible Juno Energy model to 'juno.dts'
  arm64: dts: juno: Add idle-states to device tree
  ANDROID: Replace spaces by '_' for some android filesystem tracepoints.
  usb: gadget: f_accessory: Fix for UsbAccessory clean unbind.
  android: binder: move global binder state into context struct.
  android: binder: add padding to binder_fd_array_object.
  binder: use group leader instead of open thread
  nf: IDLETIMER: Use fullsock when querying uid
  nf: IDLETIMER: Fix use after free condition during work
  ANDROID: dm: android-verity: fix table_make_digest() error handling
  ANDROID: usb: gadget: function: Fix commenting style
  cpufreq: interactive governor drops bits in time calculation
  ANDROID: sdcardfs: support direct-IO (DIO) operations
  ANDROID: sdcardfs: implement vm_ops->page_mkwrite
  ANDROID: sdcardfs: Don't bother deleting freelist
  ANDROID: sdcardfs: Add missing path_put
  ANDROID: sdcardfs: Fix incorrect hash
  ANDROID: ext4 crypto: Disables zeroing on truncation when there's no key
  ANDROID: ext4: add a non-reversible key derivation method
  ANDROID: ext4: allow encrypting filenames using HEH algorithm
  ANDROID: arm64/crypto: add ARMv8-CE optimized poly_hash algorithm
  ANDROID: crypto: heh - factor out poly_hash algorithm
  ANDROID: crypto: heh - Add Hash-Encrypt-Hash (HEH) algorithm
  ANDROID: crypto: gf128mul - Add ble multiplication functions
  ANDROID: crypto: gf128mul - Refactor gf128 overflow macros and tables
  UPSTREAM: crypto: gf128mul - Zero memory when freeing multiplication table
  ANDROID: crypto: shash - Add crypto_grab_shash() and crypto_spawn_shash_alg()
  ANDROID: crypto: allow blkcipher walks over ablkcipher data
  UPSTREAM: arm/arm64: crypto: assure that ECB modes don't require an IV
  ANDROID: Refactor fs readpage/write tracepoints.
  ANDROID: export security_path_chown
  Squashfs: optimize reading uncompressed data
  Squashfs: implement .readpages()
  Squashfs: replace buffer_head with BIO
  Squashfs: refactor page_actor
  Squashfs: remove the FILE_CACHE option
  ANDROID: android-recommended.cfg: CONFIG_CPU_SW_DOMAIN_PAN=y
  FROMLIST: 9p: fix a potential acl leak
  BACKPORT: posix_acl: Clear SGID bit when setting file permissions
  UPSTREAM: udp: properly support MSG_PEEK with truncated buffers
  UPSTREAM: arm64: Allow hw watchpoint of length 3,5,6 and 7
  BACKPORT: arm64: hw_breakpoint: Handle inexact watchpoint addresses
  UPSTREAM: arm64: Allow hw watchpoint at varied offset from base address
  BACKPORT: hw_breakpoint: Allow watchpoint of length 3,5,6 and 7
  ANDROID: sdcardfs: Switch strcasecmp for internal call
  ANDROID: sdcardfs: switch to full_name_hash and qstr
  ANDROID: sdcardfs: Add GID Derivation to sdcardfs
  ANDROID: sdcardfs: Remove redundant operation
  ANDROID: sdcardfs: add support for user permission isolation
  ANDROID: sdcardfs: Refactor configfs interface
  ANDROID: sdcardfs: Allow non-owners to touch
  ANDROID: binder: fix format specifier for type binder_size_t
  ANDROID: fs: Export vfs_rmdir2
  ANDROID: fs: Export free_fs_struct and set_fs_pwd
  BACKPORT: Input: xpad - validate USB endpoint count during probe
  BACKPORT: Input: xpad - fix oops when attaching an unknown Xbox One gamepad
  ANDROID: mnt: remount should propagate to slaves of slaves
  ANDROID: sdcardfs: Switch ->d_inode to d_inode()
  ANDROID: sdcardfs: Fix locking issue with permision fix up
  ANDROID: sdcardfs: Change magic value
  ANDROID: sdcardfs: Use per mount permissions
  ANDROID: sdcardfs: Add gid and mask to private mount data
  ANDROID: sdcardfs: User new permission2 functions
  ANDROID: vfs: Add setattr2 for filesystems with per mount permissions
  ANDROID: vfs: Add permission2 for filesystems with per mount permissions
  ANDROID: vfs: Allow filesystems to access their private mount data
  ANDROID: mnt: Add filesystem private data to mount points
  ANDROID: sdcardfs: Move directory unlock before touch
  ANDROID: sdcardfs: fix external storage exporting incorrect uid
  ANDROID: sdcardfs: Added top to sdcardfs_inode_info
  ANDROID: sdcardfs: Switch package list to RCU
  ANDROID: sdcardfs: Fix locking for permission fix up
  ANDROID: sdcardfs: Check for other cases on path lookup
  ANDROID: sdcardfs: override umask on mkdir and create
  arm64: kernel: Fix build warning
  DEBUG: sched/fair: Fix sched_load_avg_cpu events for task_groups
  DEBUG: sched/fair: Fix missing sched_load_avg_cpu events
  UPSTREAM: l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
  UPSTREAM: packet: fix race condition in packet_set_ring
  UPSTREAM: netlink: Fix dump skb leak/double free
  UPSTREAM: net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
  MIPS: Prevent "restoration" of MSA context in non-MSA kernels
  net: socket: don't set sk_uid to garbage value in ->setattr()
  ANDROID: configs: CONFIG_ARM64_SW_TTBR0_PAN=y
  UPSTREAM: arm64: Disable PAN on uaccess_enable()
  UPSTREAM: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN
  UPSTREAM: arm64: xen: Enable user access before a privcmd hvc call
  UPSTREAM: arm64: Handle faults caused by inadvertent user access with PAN enabled
  BACKPORT: arm64: Disable TTBR0_EL1 during normal kernel execution
  BACKPORT: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1
  BACKPORT: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro
  BACKPORT: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros
  UPSTREAM: arm64: alternative: add auto-nop infrastructure
  UPSTREAM: arm64: barriers: introduce nops and __nops macros for NOP sequences
  Revert "FROMLIST: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros"
  Revert "FROMLIST: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro"
  Revert "FROMLIST: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1"
  Revert "FROMLIST: arm64: Disable TTBR0_EL1 during normal kernel execution"
  Revert "FROMLIST: arm64: Handle faults caused by inadvertent user access with PAN enabled"
  Revert "FROMLIST: arm64: xen: Enable user access before a privcmd hvc call"
  Revert "FROMLIST: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN"
  ANDROID: sched/walt: fix build failure if FAIR_GROUP_SCHED=n
  ANDROID: trace: net: use %pK for kernel pointers
  ANDROID: android-base: Enable QUOTA related configs
  net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
  net: inet: Support UID-based routing in IP protocols.
  net: core: add UID to flows, rules, and routes
  net: core: Add a UID field to struct sock.
  Revert "net: core: Support UID-based routing."
  UPSTREAM: efi/arm64: Don't apply MEMBLOCK_NOMAP to UEFI memory map mapping
  UPSTREAM: arm64: mm: always take dirty state from new pte in ptep_set_access_flags
  UPSTREAM: arm64: Implement pmdp_set_access_flags() for hardware AF/DBM
  UPSTREAM: arm64: Fix typo in the pmdp_huge_get_and_clear() definition
  UPSTREAM: arm64: enable CONFIG_DEBUG_RODATA by default
  goldfish: enable CONFIG_INET_DIAG_DESTROY
  sched/walt: kill {min,max}_capacity
  sched: fix wrong truncation of walt_avg
  build: fix build config kernel_dir
  ANDROID: dm verity: add minimum prefetch size
  build: add build server configs for goldfish
  usb: gadget: Fix compilation problem with tx_qlen field

Conflicts:
	android/configs/android-base.cfg
	arch/arm64/Makefile
	arch/arm64/include/asm/cpufeature.h
	arch/arm64/kernel/vdso/gettimeofday.S
	arch/arm64/mm/cache.S
	drivers/md/Kconfig
	drivers/misc/Makefile
	drivers/mmc/host/sdhci.c
	drivers/usb/core/hcd.c
	drivers/usb/gadget/function/u_ether.c
	fs/sdcardfs/derived_perm.c
	fs/sdcardfs/file.c
	fs/sdcardfs/inode.c
	fs/sdcardfs/lookup.c
	fs/sdcardfs/main.c
	fs/sdcardfs/multiuser.h
	fs/sdcardfs/packagelist.c
	fs/sdcardfs/sdcardfs.h
	fs/sdcardfs/super.c
	include/linux/mmc/card.h
	include/linux/mmc/mmc.h
	include/trace/events/android_fs.h
	include/trace/events/android_fs_template.h
	drivers/android/binder.c
	fs/exec.c
	fs/ext4/crypto_key.c
	fs/ext4/ext4.h
	fs/ext4/inline.c
	fs/ext4/inode.c
	fs/ext4/readpage.c
	fs/f2fs/data.c
	fs/f2fs/inline.c
	fs/mpage.c
	include/linux/dcache.h
	include/trace/events/sched.h
	include/uapi/linux/ipv6.h
	net/ipv4/tcp_ipv4.c
	net/netfilter/xt_IDLETIMER.c

Change-Id: Ie345db6a14869fe0aa794aef4b71b5d0d503690b
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2017-04-20 15:19:15 -07:00
Eric Dumazet
2681a7853a tcp: fix various issues for sockets morphing to listen state
[ Upstream commit 02b2faaf0af1d85585f6d6980e286d53612acfc2 ]

Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)

I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.

1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction

Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
 states from tcp_v6_mtu_reduced()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-22 12:04:15 +01:00
Ravi Joshi
15ada4d01b WLAN subsystem: Sysctl support for key TCP/IP parameters
It has been observed that default values for some of key tcp/ip
parameters are affecting the tput/performance of the system. Hence
extending configuration capabilities to TCP/Ip stack through
sysctl interface

Change-Id: I4287e9103769535f43e0934bac08435a524ee6a4
CRs-Fixed: 507581
Signed-off-by: Ravi Joshi <ravij@codeaurora.org>
Signed-off-by: Ganesh Babu Kumaravel <kganesh@codeaurora.org>
Signed-off-by: Mohit Khanna <mkhannaqca@codeaurora.org>
2016-07-15 13:35:09 -07:00
Yuchung Cheng
dd52bc2b4e tcp: fix Fast Open snmp over-counting bug
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Yuchung Cheng
0e45f4da59 tcp: disable Fast Open on timeouts after handshake
Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Richard Sailer
7533ce3055 tcp: change type of alive from int to bool
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.

Therefore this changes the type of alive to bool in the function
definition and all calling locations.

Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.

Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 05:15:03 -07:00
Eric Dumazet
a4e2405cc5 tcp: do not export tcp_init_xmit_timers()
After commit 900f65d361 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:44:38 -07:00
Eric Dumazet
b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
e520af48c7 tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
Diagnosing problems related to Window Probes has been hard because
we lack a counter.

TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.

TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:42:32 -04:00
Daniel Lee
2646c831c0 tcp: RFC7413 option support for Fast Open client
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
Eric Dumazet
fa76ce7328 inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Fan Du
05cbc0db03 ipv4: Create probe timer for tcp PMTU as per RFC4821
As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:42 -05:00
Fan Du
b0f9ca53cb ipv4: Namespecify TCP PMTU mechanism
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.

Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 18:45:00 -08:00
Joe Perches
ba7a46f16d net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_<LEVEL> uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled.  Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings".  For backward compatibility,
the sysctl is not removed, but it has no function.  The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 14:10:31 -05:00
Yuchung Cheng
b248230c34 tcp: abort orphan sockets stalling on zero window probes
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).

But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html

This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.

In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:27:52 -04:00
Eric Dumazet
fcdd1cf4dd tcp: avoid possible arithmetic overflows
icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
or more if some sysctl (eg tcp_retries2) are changed.

Better use 64bit to perform icsk_rto << icsk_backoff operations

As Joe Perches suggested, add a helper for this.

Yuchung spotted the tcp_v4_err() case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 16:27:10 -04:00
Eric Dumazet
7faee5c0d5 tcp: remove TCP_SKB_CB(skb)->when
After commit 740b0f1841 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.

TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:49:33 -07:00
Neal Cardwell
5ae344c949 tcp: reduce spurious retransmits due to transient SACK reneging
This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

 (1) 90% of inter-ACK delays were less than 10ms
 (2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:29:33 -07:00
Yuchung Cheng
f19c29e3e3 tcp: snmp stats for Fast Open, SYN rtx, and data pkts
Add the following snmp stats:

TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.

TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.

TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.

Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03 15:58:03 -05:00
Yuchung Cheng
c968601d17 tcp: temporarily disable Fast Open on SYN timeout
Fast Open currently has a fall back feature to address SYN-data being
dropped but it requires the middle-box to pass on regular SYN retry
after SYN-data. This is implemented in commit aab487435 ("net-tcp:
Fast Open client - detecting SYN-data drops")

However some NAT boxes will drop all subsequent packets after first
SYN-data and blackholes the entire connections.  An example is in
commit 356d7d8 "netfilter: nf_conntrack: fix tcp_in_window for Fast
Open".

The sender should note such incidents and fall back to use the regular
TCP handshake on subsequent attempts temporarily as well: after the
second SYN timeouts the original Fast Open SYN is most likely lost.
When such an event recurs Fast Open is disabled based on the number of
recurrences exponentially.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29 22:50:41 -04:00
Eric Dumazet
efe4208f47 ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-09 00:01:25 -04:00
Yuchung Cheng
9b44190dc1 tcp: refactor F-RTO
The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 11:47:50 -04:00
Nandita Dukkipati
9b717a8d24 tcp: TLP loss detection.
This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.

This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01

The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
Nandita Dukkipati
6ba8a3b19e tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
David S. Miller
d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Eric Dumazet
e6c022a4fa tcp: better retrans tracking for defer-accept
For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:45:00 -04:00
Jerry Chu
37561f68bd tcp: Reject invalid ack_seq to Fast Open sockets
A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch
to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic.

When a FIN packet with an invalid ack_seq# arrives at a socket in
the TCP_FIN_WAIT1 state, rather than discarding the packet, the current
code will accept the FIN, causing state transition to TCP_CLOSING.

This may be a small deviation from RFC793, which seems to say that the
packet should be dropped. Unfortunately I did not expect this case for
Fast Open hence it will trigger a BUG_ON panic.

It turns out there is really nothing bad about a TFO socket going into
TCP_CLOSING state so I could just remove the BUG_ON statements. But after
some thought I think it's better to treat this case like TCP_SYN_RECV
and return a RST to the confused peer who caused the unacceptable ack_seq
to be generated in the first place.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 02:42:56 -04:00
Jerry Chu
8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Eric Dumazet
144d56e910 tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :

[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]

The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.

timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.

We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.

For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21 14:42:23 -07:00
Eric Dumazet
6f458dfb40 tcp: improve latencies of timer triggered events
Modern TCP stack highly depends on tcp_write_timer() having a small
latency, but current implementation doesn't exactly meet the
expectations.

When a timer fires but finds the socket is owned by the user, it rearms
itself for an additional delay hoping next run will be more
successful.

tcp_write_timer() for example uses a 50ms delay for next try, and it
defeats many attempts to get predictable TCP behavior in term of
latencies.

Use the recently introduced tcp_release_cb(), so that the user owning
the socket will call various handlers right before socket release.

This will permit us to post a followup patch to address the
tcp_tso_should_defer() syndrome (some deferred packets have to wait
RTO timer to be transmitted, while cwnd should allow us to send them
sooner)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
Yuchung Cheng
750ea2bafa tcp: early retransmit: delayed fast retransmit
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02 20:56:10 -04:00
Joe Perches
afd465030a net: ipv4: Standardize prefixes for message logging
Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-12 17:05:21 -07:00
Arun Sharma
efcdbf24fd net: Disambiguate kernel message
Some of our machines were reporting:

TCP: too many of orphaned sockets

even when the number of orphaned sockets was well below the
limit.

We print a different message depending on whether we're out
of TCP memory or there are too many orphaned sockets.

Also move the check out of line and cleanup the messages
that were printed.

Signed-off-by: Arun Sharma <asharma@fb.com>
Suggested-by: Mohan Srinivasan <mohan@fb.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-01 14:41:50 -05:00
Rusty Russell
3db1cd5c05 net: fix assignment of 0/1 to bool variables.
DaveM said:
   Please, this kind of stuff rots forever and not using bool properly
   drives me crazy.

Joe Perches <joe@perches.com> gave me the spatch script:

	@@
	bool b;
	@@
	-b = 0
	+b = false
	@@
	bool b;
	@@
	-b = 1
	+b = true

I merely installed coccinelle, read the documentation and took credit.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-19 22:27:29 -05:00
Glauber Costa
180d8cd942 foundations of per-cgroup memory pressure controlling.
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:04:10 -05:00
Eric Dumazet
dfd56b8b38 net: use IS_ENABLED(CONFIG_IPV6)
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-11 18:25:16 -05:00
Flavio Leitner
78d81d15b7 TCP: remove TCP_DEBUG
It was enabled by default and the messages guarded
by the define are useful.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 17:36:08 -04:00
Shan Wei
089c34827e tcp: Remove debug macro of TCP_CHECK_TIMER
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:10:14 -08:00
Ilpo Järvinen
c60ce4e265 tcp: use correct counters in CA_CWR state too
As CWR is stronger than CA_Disorder state, we can miscount
SACK/Reno failure into other timeouts. Not a bad problem as
it can happen only due to ECN, FRTO detecting spurious RTO
or xmit error which are the only callers of tcp_enter_cwr.
And even then losses and RTO must still follow thereafter
to actually end up into the relevant code paths.

Compile tested.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-17 13:46:33 -07:00
David S. Miller
21a180cda0 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/Kconfig
	net/ipv4/tcp_timer.c
2010-10-04 11:56:38 -07:00
Damian Lukowski
4d22f7d372 net-2.6: SYN retransmits: Add new parameter to retransmits_timed_out()
Fixes kernel Bugzilla Bug 18952

This patch adds a syn_set parameter to the retransmits_timed_out()
routine and updates its callers. If not set, TCP_RTO_MIN is taken
as the calculation basis as before. If set, TCP_TIMEOUT_INIT is
used instead, so that sysctl_syn_retries represents the actual
amount of SYN retransmissions in case no SYNACKs are received when
establishing a new connection.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-28 13:08:32 -07:00
David S. Miller
e548833df8 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/main.c
2010-09-09 22:27:33 -07:00
Jerry Chu
dca43c75e7 tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".

TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.

Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.

The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.

The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.

This option, like many others, will be inherited by an acceptor from its
listener.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:23:33 -07:00
David S. Miller
ad1af0fedb tcp: Combat per-cpu skew in orphan tests.
As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.

Fix this by doing the full expensive sum check if the optimized check
triggers.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-08-25 02:27:49 -07:00
Eric Dumazet
4bc2f18ba4 net/ipv4: EXPORT_SYMBOL cleanups
CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12 12:57:54 -07:00