Commit graph

482 commits

Author SHA1 Message Date
Runmin Wang
750075feff Merge remote-tracking branch 'origin/tmp-917a9a9133a6' into lsk
* tmp-917a9:
  ARM/vdso: Mark the vDSO code read-only after init
  x86/vdso: Mark the vDSO code read-only after init
  lkdtm: Verify that '__ro_after_init' works correctly
  arch: Introduce post-init read-only memory
  x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
  mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
  asm-generic: Consolidate mark_rodata_ro()
  Linux 4.4.6
  ld-version: Fix awk regex compile failure
  target: Drop incorrect ABORT_TASK put for completed commands
  block: don't optimize for non-cloned bio in bio_get_last_bvec()
  MIPS: smp.c: Fix uninitialised temp_foreign_map
  MIPS: Fix build error when SMP is used without GIC
  ovl: fix getcwd() failure after unsuccessful rmdir
  ovl: copy new uid/gid into overlayfs runtime inode
  userfaultfd: don't block on the last VM updates at exit time
  powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
  powerpc/powernv: Add a kmsg_dumper that flushes console output on panic
  powerpc: Fix dedotify for binutils >= 2.26
  Revert "drm/radeon/pm: adjust display configuration after powerstate"
  drm/radeon: Fix error handling in radeon_flip_work_func.
  drm/amdgpu: Fix error handling in amdgpu_flip_work_func.
  Revert "drm/radeon: call hpd_irq_event on resume"
  x86/mm: Fix slow_virt_to_phys() for X86_PAE again
  gpu: ipu-v3: Do not bail out on missing optional port nodes
  mac80211: Fix Public Action frame RX in AP mode
  mac80211: check PN correctly for GCMP-encrypted fragmented MPDUs
  mac80211: minstrel_ht: fix a logic error in RTS/CTS handling
  mac80211: minstrel_ht: set default tx aggregation timeout to 0
  mac80211: fix use of uninitialised values in RX aggregation
  mac80211: minstrel: Change expected throughput unit back to Kbps
  iwlwifi: mvm: inc pending frames counter also when txing non-sta
  can: gs_usb: fixed disconnect bug by removing erroneous use of kfree()
  cfg80211/wext: fix message ordering
  wext: fix message delay/ordering
  ovl: fix working on distributed fs as lower layer
  ovl: ignore lower entries when checking purity of non-directory entries
  ASoC: wm8958: Fix enum ctl accesses in a wrong type
  ASoC: wm8994: Fix enum ctl accesses in a wrong type
  ASoC: samsung: Use IRQ safe spin lock calls
  ASoC: dapm: Fix ctl value accesses in a wrong type
  ncpfs: fix a braino in OOM handling in ncp_fill_cache()
  jffs2: reduce the breakage on recovery from halfway failed rename()
  dmaengine: at_xdmac: fix residue computation
  tracing: Fix check for cpu online when event is disabled
  s390/dasd: fix diag 0x250 inline assembly
  s390/mm: four page table levels vs. fork
  KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
  KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
  KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit
  KVM: s390: correct fprs on SIGP (STOP AND) STORE STATUS
  KVM: VMX: disable PEBS before a guest entry
  kvm: cap halt polling at exactly halt_poll_ns
  PCI: Allow a NULL "parent" pointer in pci_bus_assign_domain_nr()
  ARM: OMAP2+: hwmod: Introduce ti,no-idle dt property
  ARM: dts: dra7: do not gate cpsw clock due to errata i877
  ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window
  arm64: account for sparsemem section alignment when choosing vmemmap offset
  Linux 4.4.5
  drm/amdgpu: fix topaz/tonga gmc assignment in 4.4 stable
  modules: fix longstanding /proc/kallsyms vs module insertion race.
  drm/i915: refine qemu south bridge detection
  drm/i915: more virtual south bridge detection
  block: get the 1st and last bvec via helpers
  block: check virt boundary in bio_will_gap()
  drm/amdgpu: Use drm_calloc_large for VM page_tables array
  thermal: cpu_cooling: fix out of bounds access in time_in_idle
  i2c: brcmstb: allocate correct amount of memory for regmap
  ubi: Fix out of bounds write in volume update code
  cxl: Fix PSL timebase synchronization detection
  MIPS: traps: Fix SIGFPE information leak from `do_ov' and `do_trap_or_bp'
  MIPS: scache: Fix scache init with invalid line size.
  USB: serial: option: add support for Quectel UC20
  USB: serial: option: add support for Telit LE922 PID 0x1045
  USB: qcserial: add Sierra Wireless EM74xx device ID
  USB: qcserial: add Dell Wireless 5809e Gobi 4G HSPA+ (rev3)
  USB: cp210x: Add ID for Parrot NMEA GPS Flight Recorder
  usb: chipidea: otg: change workqueue ci_otg as freezable
  ALSA: timer: Fix broken compat timer user status ioctl
  ALSA: hdspm: Fix zero-division
  ALSA: hdsp: Fix wrong boolean ctl value accesses
  ALSA: hdspm: Fix wrong boolean ctl value accesses
  ALSA: seq: oss: Don't drain at closing a client
  ALSA: pcm: Fix ioctls for X32 ABI
  ALSA: timer: Fix ioctls for X32 ABI
  ALSA: rawmidi: Fix ioctls X32 ABI
  ALSA: hda - Fix mic issues on Acer Aspire E1-472
  ALSA: ctl: Fix ioctls for X32 ABI
  ALSA: usb-audio: Add a quirk for Plantronics DA45
  adv7604: fix tx 5v detect regression
  dmaengine: pxa_dma: fix cyclic transfers
  Fix directory hardlinks from deleted directories
  jffs2: Fix page lock / f->sem deadlock
  Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
  Btrfs: fix loading of orphan roots leading to BUG_ON
  pata-rb532-cf: get rid of the irq_to_gpio() call
  tracing: Do not have 'comm' filter override event 'comm' field
  ata: ahci: don't mark HotPlugCapable Ports as external/removable
  PM / sleep / x86: Fix crash on graph trace through x86 suspend
  arm64: vmemmap: use virtual projection of linear region
  Adding Intel Lewisburg device IDs for SATA
  writeback: flush inode cgroup wb switches instead of pinning super_block
  block: bio: introduce helpers to get the 1st and last bvec
  libata: Align ata_device's id on a cacheline
  libata: fix HDIO_GET_32BIT ioctl
  drm/amdgpu: return from atombios_dp_get_dpcd only when error
  drm/amdgpu/gfx8: specify which engine to wait before vm flush
  drm/amdgpu: apply gfx_v8 fixes to gfx_v7 as well
  drm/amdgpu/pm: update current crtc info after setting the powerstate
  drm/radeon/pm: update current crtc info after setting the powerstate
  drm/ast: Fix incorrect register check for DRAM width
  target: Fix WRITE_SAME/DISCARD conversion to linux 512b sectors
  iommu/vt-d: Use BUS_NOTIFY_REMOVED_DEVICE in hotplug path
  iommu/amd: Fix boot warning when device 00:00.0 is not iommu covered
  iommu/amd: Apply workaround for ATS write permission check
  arm/arm64: KVM: Fix ioctl error handling
  KVM: x86: fix root cause for missed hardware breakpoints
  vfio: fix ioctl error handling
  Fix cifs_uniqueid_to_ino_t() function for s390x
  CIFS: Fix SMB2+ interim response processing for read requests
  cifs: fix out-of-bounds access in lease parsing
  fbcon: set a default value to blink interval
  kvm: x86: Update tsc multiplier on change.
  mips/kvm: fix ioctl error handling
  parisc: Fix ptrace syscall number and return value modification
  PCI: keystone: Fix MSI code that retrieves struct pcie_port pointer
  block: Initialize max_dev_sectors to 0
  drm/amdgpu: mask out WC from BO on unsupported arches
  btrfs: async-thread: Fix a use-after-free error for trace
  btrfs: Fix no_space in write and rm loop
  Btrfs: fix deadlock running delayed iputs at transaction commit time
  drivers: sh: Restore legacy clock domain on SuperH platforms
  use ->d_seq to get coherency between ->d_inode and ->d_flags
  Linux 4.4.4
  iwlwifi: mvm: don't allow sched scans without matches to be started
  iwlwifi: update and fix 7265 series PCI IDs
  iwlwifi: pcie: properly configure the debug buffer size for 8000
  iwlwifi: dvm: fix WoWLAN
  security: let security modules use PTRACE_MODE_* with bitmasks
  IB/cma: Fix RDMA port validation for iWarp
  x86/irq: Plug vector cleanup race
  x86/irq: Call irq_force_move_complete with irq descriptor
  x86/irq: Remove outgoing CPU from vector cleanup mask
  x86/irq: Remove the cpumask allocation from send_cleanup_vector()
  x86/irq: Clear move_in_progress before sending cleanup IPI
  x86/irq: Remove offline cpus from vector cleanup
  x86/irq: Get rid of code duplication
  x86/irq: Copy vectormask instead of an AND operation
  x86/irq: Check vector allocation early
  x86/irq: Reorganize the search in assign_irq_vector
  x86/irq: Reorganize the return path in assign_irq_vector
  x86/irq: Do not use apic_chip_data.old_domain as temporary buffer
  x86/irq: Validate that irq descriptor is still active
  x86/irq: Fix a race in x86_vector_free_irqs()
  x86/irq: Call chip->irq_set_affinity in proper context
  x86/entry/compat: Add missing CLAC to entry_INT80_32
  x86/mpx: Fix off-by-one comparison with nr_registers
  hpfs: don't truncate the file when delete fails
  do_last(): ELOOP failure exit should be done after leaving RCU mode
  should_follow_link(): validate ->d_seq after having decided to follow
  xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.
  xen/pciback: Save the number of MSI-X entries to be copied later.
  xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY
  xen/scsiback: correct frontend counting
  xen/arm: correctly handle DMA mapping of compound pages
  ARM: at91/dt: fix typo in sama5d2 pinmux descriptions
  ARM: OMAP2+: Fix onenand initialization to avoid filesystem corruption
  do_last(): don't let a bogus return value from ->open() et.al. to confuse us
  kernel/resource.c: fix muxed resource handling in __request_region()
  sunrpc/cache: fix off-by-one in qword_get()
  tracing: Fix showing function event in available_events
  powerpc/eeh: Fix partial hotplug criterion
  KVM: x86: MMU: fix ubsan index-out-of-range warning
  KVM: x86: fix conversion of addresses to linear in 32-bit protected mode
  KVM: x86: fix missed hardware breakpoints
  KVM: arm/arm64: vgic: Ensure bitmaps are long enough
  KVM: async_pf: do not warn on page allocation failures
  of/irq: Fix msi-map calculation for nonzero rid-base
  NFSv4: Fix a dentry leak on alias use
  nfs: fix nfs_size_to_loff_t
  block: fix use-after-free in dio_bio_complete
  bio: return EINTR if copying to user space got interrupted
  i2c: i801: Adding Intel Lewisburg support for iTCO
  phy: core: fix wrong err handle for phy_power_on
  writeback: keep superblock pinned during cgroup writeback association switches
  cgroup: make sure a parent css isn't offlined before its children
  cpuset: make mm migration asynchronous
  PCI/AER: Flush workqueue on device remove to avoid use-after-free
  ARCv2: SMP: Emulate IPI to self using software triggered interrupt
  ARCv2: STAR 9000950267: Handle return from intr to Delay Slot #2
  libata: fix sff host state machine locking while polling
  qla2xxx: Fix stale pointer access.
  spi: atmel: fix gpio chip-select in case of non-DT platform
  target: Fix race with SCF_SEND_DELAYED_TAS handling
  target: Fix remote-port TMR ABORT + se_cmd fabric stop
  target: Fix TAS handling for multi-session se_node_acls
  target: Fix LUN_RESET active TMR descriptor handling
  target: Fix LUN_RESET active I/O handling for ACK_KREF
  ALSA: hda - Fixing background noise on Dell Inspiron 3162
  ALSA: hda - Apply clock gate workaround to Skylake, too
  Revert "workqueue: make sure delayed work run in local cpu"
  workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  mac80211: Requeue work after scan complete for all VIF types.
  rfkill: fix rfkill_fop_read wait_event usage
  tick/nohz: Set the correct expiry when switching to nohz/lowres mode
  perf stat: Do not clean event's private stats
  cdc-acm:exclude Samsung phone 04e8:685d
  Revert "Staging: panel: usleep_range is preferred over udelay"
  Staging: speakup: Fix getting port information
  sd: Optimal I/O size is in bytes, not sectors
  libceph: don't spam dmesg with stray reply warnings
  libceph: use the right footer size when skipping a message
  libceph: don't bail early from try_read() when skipping a message
  libceph: fix ceph_msg_revoke()
  seccomp: always propagate NO_NEW_PRIVS on tsync
  cpufreq: Fix NULL reference crash while accessing policy->governor_data
  cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
  hwmon: (ads1015) Handle negative conversion values correctly
  hwmon: (gpio-fan) Remove un-necessary speed_index lookup for thermal hook
  hwmon: (dell-smm) Blacklist Dell Studio XPS 8000
  Thermal: do thermal zone update after a cooling device registered
  Thermal: handle thermal zone device properly during system sleep
  Thermal: initialize thermal zone device correctly
  IB/mlx5: Expose correct maximum number of CQE capacity
  IB/qib: Support creating qps with GFP_NOIO flag
  IB/qib: fix mcast detach when qp not attached
  IB/cm: Fix a recently introduced deadlock
  dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer
  dmaengine: at_xdmac: fix resume for cyclic transfers
  dmaengine: dw: fix cyclic transfer callbacks
  dmaengine: dw: fix cyclic transfer setup
  nfit: fix multi-interface dimm handling, acpi6.1 compatibility
  ACPI / PCI / hotplug: unlock in error path in acpiphp_enable_slot()
  ACPI: Revert "ACPI / video: Add Dell Inspiron 5737 to the blacklist"
  ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830
  ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700
  lib: sw842: select crc32
  uapi: update install list after nvme.h rename
  ideapad-laptop: Add Lenovo Yoga 700 to no_hw_rfkill dmi list
  ideapad-laptop: Add Lenovo ideapad Y700-17ISK to no_hw_rfkill dmi list
  toshiba_acpi: Fix blank screen at boot if transflective backlight is supported
  make sure that freeing shmem fast symlinks is RCU-delayed
  drm/radeon/pm: adjust display configuration after powerstate
  drm/radeon: Don't hang in radeon_flip_work_func on disabled crtc. (v2)
  drm: Fix treatment of drm_vblank_offdelay in drm_vblank_on() (v2)
  drm: Fix drm_vblank_pre/post_modeset regression from Linux 4.4
  drm: Prevent vblank counter bumps > 1 with active vblank clients. (v2)
  drm: No-Op redundant calls to drm_vblank_off() (v2)
  drm/radeon: use post-decrement in error handling
  drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command
  drm/i915: fix error path in intel_setup_gmbus()
  drm/i915/dsi: don't pass arbitrary data to sideband
  drm/i915/dsi: defend gpio table against out of bounds access
  drm/i915/skl: Don't skip mst encoders in skl_ddi_pll_select()
  drm/i915: Don't reject primary plane windowing with color keying enabled on SKL+
  drm/i915/dp: fall back to 18 bpp when sink capability is unknown
  drm/i915: Make sure DC writes are coherent on flush.
  drm/i915: Init power domains early in driver load
  drm/i915: intel_hpd_init(): Fix suspend/resume reprobing
  drm/i915: Restore inhibiting the load of the default context
  drm: fix missing reference counting decrease
  drm/radeon: hold reference to fences in radeon_sa_bo_new
  drm/radeon: mask out WC from BO on unsupported arches
  drm: add helper to check for wc memory support
  drm/radeon: fix DP audio support for APU with DCE4.1 display engine
  drm/radeon: Add a common function for DFS handling
  drm/radeon: cleaned up VCO output settings for DP audio
  drm/radeon: properly byte swap vce firmware setup
  drm/radeon: clean up fujitsu quirks
  drm/radeon: Fix "slow" audio over DP on DCE8+
  drm/radeon: call hpd_irq_event on resume
  drm/radeon: Fix off-by-one errors in radeon_vm_bo_set_addr
  drm/dp/mst: deallocate payload on port destruction
  drm/dp/mst: Reverse order of MST enable and clearing VC payload table.
  drm/dp/mst: move GUID storage from mgr, port to only mst branch
  drm/dp/mst: Calculate MST PBN with 31.32 fixed point
  drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil
  drm/dp/mst: fix in RAD element access
  drm/dp/mst: fix in MSTB RAD initialization
  drm/dp/mst: always send reply for UP request
  drm/dp/mst: process broadcast messages correctly
  drm/nouveau: platform: Fix deferred probe
  drm/nouveau/disp/dp: ensure sink is powered up before attempting link training
  drm/nouveau/display: Enable vblank irqs after display engine is on again.
  drm/nouveau/kms: take mode_config mutex in connector hotplug path
  drm/amdgpu/pm: adjust display configuration after powerstate
  drm/amdgpu: Don't hang in amdgpu_flip_work_func on disabled crtc.
  drm/amdgpu: use post-decrement in error handling
  drm/amdgpu: fix issue with overlapping userptrs
  drm/amdgpu: hold reference to fences in amdgpu_sa_bo_new (v2)
  drm/amdgpu: remove unnecessary forward declaration
  drm/amdgpu: fix s4 resume
  drm/amdgpu: remove exp hardware support from iceland
  drm/amdgpu: don't load MEC2 on topaz
  drm/amdgpu: drop topaz support from gmc8 module
  drm/amdgpu: pull topaz gmc bits into gmc_v7
  drm/amdgpu: The VI specific EXE bit should only apply to GMC v8.0 above
  drm/amdgpu: iceland use CI based MC IP
  drm/amdgpu: move gmc7 support out of CIK dependency
  drm/amdgpu: no need to load MC firmware on fiji
  drm/amdgpu: fix amdgpu_bo_pin_restricted VRAM placing v2
  drm/amdgpu: fix tonga smu resume
  drm/amdgpu: fix lost sync_to if scheduler is enabled.
  drm/amdgpu: call hpd_irq_event on resume
  drm/amdgpu: Fix off-by-one errors in amdgpu_vm_bo_map
  drm/vmwgfx: respect 'nomodeset'
  drm/vmwgfx: Fix a width / pitch mismatch on framebuffer updates
  drm/vmwgfx: Fix an incorrect lock check
  virtio_pci: fix use after free on release
  virtio_balloon: fix race between migration and ballooning
  virtio_balloon: fix race by fill and leak
  regulator: mt6311: MT6311_REGULATOR needs to select REGMAP_I2C
  regulator: axp20x: Fix GPIO LDO enable value for AXP22x
  clk: exynos: use irqsave version of spin_lock to avoid deadlock with irqs
  cxl: use correct operator when writing pcie config space values
  sparc64: fix incorrect sign extension in sys_sparc64_personality
  EDAC, mc_sysfs: Fix freeing bus' name
  EDAC: Robustify workqueues destruction
  MIPS: Fix buffer overflow in syscall_get_arguments()
  MIPS: Fix some missing CONFIG_CPU_MIPSR6 #ifdefs
  MIPS: hpet: Choose a safe value for the ETIME check
  MIPS: Loongson-3: Fix SMP_ASK_C0COUNT IPI handler
  Revert "MIPS: Fix PAGE_MASK definition"
  cputime: Prevent 32bit overflow in time[val|spec]_to_cputime()
  time: Avoid signed overflow in timekeeping_get_ns()
  Bluetooth: 6lowpan: Fix handling of uncompressed IPv6 packets
  Bluetooth: 6lowpan: Fix kernel NULL pointer dereferences
  Bluetooth: Fix incorrect removing of IRKs
  Bluetooth: Add support of Toshiba Broadcom based devices
  Bluetooth: Use continuous scanning when creating LE connections
  Drivers: hv: vmbus: Fix a Host signaling bug
  tools: hv: vss: fix the write()'s argument: error -> vss_msg
  mmc: sdhci: Allow override of get_cd() called from sdhci_request()
  mmc: sdhci: Allow override of mmc host operations
  mmc: sdhci-pci: Fix card detect race for Intel BXT/APL
  mmc: pxamci: fix again read-only gpio detection polarity
  mmc: sdhci-acpi: Fix card detect race for Intel BXT/APL
  mmc: mmci: fix an ages old detection error
  mmc: core: Enable tuning according to the actual timing
  mmc: sdhci: Fix sdhci_runtime_pm_bus_on/off()
  mmc: mmc: Fix incorrect use of driver strength switching HS200 and HS400
  mmc: sdio: Fix invalid vdd in voltage switch power cycle
  mmc: sdhci: Fix DMA descriptor with zero data length
  mmc: sdhci-pci: Do not default to 33 Ohm driver strength for Intel SPT
  mmc: usdhi6rol0: handle NULL data in timeout
  clockevents/tcb_clksrc: Prevent disabling an already disabled clock
  posix-clock: Fix return code on the poll method's error path
  irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1
  irqchip/atmel-aic: Fix wrong bit operation for IRQ priority
  irqchip/mxs: Add missing set_handle_irq()
  irqchip/omap-intc: Add support for spurious irq handling
  coresight: checking for NULL string in coresight_name_match()
  dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths
  dm snapshot: fix hung bios when copy error occurs
  dm space map metadata: remove unused variable in brb_pop()
  tda1004x: only update the frontend properties if locked
  vb2: fix a regression in poll() behavior for output,streams
  gspca: ov534/topro: prevent a division by 0
  si2157: return -EINVAL if firmware blob is too big
  media: dvb-core: Don't force CAN_INVERSION_AUTO in oneshot mode
  rc: sunxi-cir: Initialize the spinlock properly
  namei: ->d_inode of a pinned dentry is stable only for positives
  mei: validate request value in client notify request ioctl
  mei: fix fasync return value on error
  rtlwifi: rtl8723be: Fix module parameter initialization
  rtlwifi: rtl8188ee: Fix module parameter initialization
  rtlwifi: rtl8192se: Fix module parameter initialization
  rtlwifi: rtl8723ae: Fix initialization of module parameters
  rtlwifi: rtl8192de: Fix incorrect module parameter descriptions
  rtlwifi: rtl8192ce: Fix handling of module parameters
  rtlwifi: rtl8192cu: Add missing parameter setup
  rtlwifi: rtl_pci: Fix kernel panic
  locks: fix unlock when fcntl_setlk races with a close
  um: link with -lpthread
  uml: fix hostfs mknod()
  uml: flush stdout before forking
  s390/fpu: signals vs. floating point control register
  s390/compat: correct restore of high gprs on signal return
  s390/dasd: fix performance drop
  s390/dasd: fix refcount for PAV reassignment
  s390/dasd: prevent incorrect length error under z/VM after PAV changes
  s390: fix normalization bug in exception table sorting
  btrfs: initialize the seq counter in struct btrfs_device
  Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
  Btrfs: fix transaction handle leak on failure to create hard link
  Btrfs: fix number of transaction units required to create symlink
  Btrfs: send, don't BUG_ON() when an empty symlink is found
  btrfs: statfs: report zero available if metadata are exhausted
  Btrfs: igrab inode in writepage
  Btrfs: add missing brelse when superblock checksum fails
  KVM: s390: fix memory overwrites when vx is disabled
  s390/kvm: remove dependency on struct save_area definition
  clocksource/drivers/vt8500: Increase the minimum delta
  genirq: Validate action before dereferencing it in handle_irq_event_percpu()
  mm: numa: quickly fail allocations for NUMA balancing on full nodes
  mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
  ocfs2: unlock inode if deleting inode from orphan fails
  drm/i915: shut up gen8+ SDE irq dmesg noise
  iw_cxgb3: Fix incorrectly returning error on success
  spi: omap2-mcspi: Prevent duplicate gpio_request
  drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE
  USB: option: add "4G LTE usb-modem U901"
  USB: option: add support for SIM7100E
  USB: cp210x: add IDs for GE B650V3 and B850V3 boards
  usb: dwc3: Fix assignment of EP transfer resources
  can: ems_usb: Fix possible tx overflow
  dm thin: fix race condition when destroying thin pool workqueue
  bcache: Change refill_dirty() to always scan entire disk if necessary
  bcache: prevent crash on changing writeback_running
  bcache: allows use of register in udev to avoid "device_busy" error.
  bcache: unregister reboot notifier if bcache fails to unregister device
  bcache: fix a leak in bch_cached_dev_run()
  bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
  bcache: Add a cond_resched() call to gc
  bcache: fix a livelock when we cause a huge number of cache misses
  lib/ucs2_string: Correct ucs2 -> utf8 conversion
  efi: Add pstore variables to the deletion whitelist
  efi: Make efivarfs entries immutable by default
  efi: Make our variable validation list include the guid
  efi: Do variable name validation tests in utf8
  efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version
  lib/ucs2_string: Add ucs2 -> utf8 helper functions
  ARM: 8457/1: psci-smp is built only for SMP
  drm/gma500: Use correct unref in the gem bo create function
  devm_memremap: Fix error value when memremap failed
  KVM: s390: fix guest fprs memory leak
  arm64: errata: Add -mpc-relative-literal-loads to build flags
  ARM: debug-ll: fix BCM63xx entry for multiplatform
  ext4: fix bh->b_state corruption
  sctp: Fix port hash table size computation
  unix_diag: fix incorrect sign extension in unix_lookup_by_ino
  tipc: unlock in error path
  rtnl: RTM_GETNETCONF: fix wrong return value
  IFF_NO_QUEUE: Fix for drivers not calling ether_setup()
  tcp/dccp: fix another race at listener dismantle
  route: check and remove route cache when we get route
  net_sched fix: reclassification needs to consider ether protocol changes
  pppoe: fix reference counting in PPPoE proxy
  l2tp: Fix error creating L2TP tunnels
  net/mlx4_en: Avoid changing dev->features directly in run-time
  net/mlx4_en: Choose time-stamping shift value according to HW frequency
  net/mlx4_en: Count HW buffer overrun only once
  qmi_wwan: add "4G LTE usb-modem U901"
  tcp: md5: release request socket instead of listener
  tipc: fix premature addition of node to lookup table
  af_unix: Guard against other == sk in unix_dgram_sendmsg
  af_unix: Don't set err in unix_stream_read_generic unless there was an error
  ipv4: fix memory leaks in ip_cmsg_send() callers
  bonding: Fix ARP monitor validation
  bpf: fix branch offset adjustment on backjumps after patching ctx expansion
  flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
  net: Copy inner L3 and L4 headers as unaligned on GRE TEB
  sctp: translate network order to host order when users get a hmacid
  enic: increment devcmd2 result ring in case of timeout
  tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
  net:Add sysctl_max_skb_frags
  tcp: do not drop syn_recv on all icmp reports
  unix: correctly track in-flight fds in sending process user_struct
  ipv6: fix a lockdep splat
  ipv6: addrconf: Fix recursive spin lock call
  ipv6/udp: use sticky pktinfo egress ifindex on connect()
  ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
  tcp: beware of alignments in tcp_get_info()
  switchdev: Require RTNL mutex to be held when sending FDB notifications
  inet: frag: Always orphan skbs inside ip_defrag()
  tipc: fix connection abort during subscription cancel
  net: dsa: fix mv88e6xxx switches
  sctp: allow setting SCTP_SACK_IMMEDIATELY by the application
  pptp: fix illegal memory access caused by multiple bind()s
  af_unix: fix struct pid memory leak
  tcp: fix NULL deref in tcp_v4_send_ack()
  lwt: fix rx checksum setting for lwt devices tunneling over ipv6
  tunnels: Allow IPv6 UDP checksums to be correctly controlled.
  net: dp83640: Fix tx timestamp overflow handling.
  gro: Make GRO aware of lightweight tunnels.
  af_iucv: Validate socket address length in iucv_sock_bind()

Conflicts:
	arch/arm64/Makefile
	arch/arm64/include/asm/cacheflush.h
	drivers/mmc/host/sdhci.c
	drivers/usb/dwc3/ep0.c
	drivers/usb/dwc3/gadget.c
	kernel/module.c
	sound/core/pcm_compat.c

CRs-Fixed: 1010239
Signed-off-by: Runmin Wang <runminw@codeaurora.org>
Change-Id: I41a28636fc9ad91f9d979b191784609476294cdf
2016-07-12 11:40:49 -07:00
Eric Dumazet
2679161c77 tcp: do not drop syn_recv on all icmp reports
[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]

Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:05 -08:00
Lorenzo Colitti
79170d8d5d net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.

[cherry-pick of net-next c1e64e298b8cad309091b95d8436a0255c84f54a]

Change-Id: I728a01ef03f2ccfb9016a3f3051ef00975980e49
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
JP Abgrall
72fc8653dc tcp: add a sysctl to config the tcp_default_init_rwnd
The default initial rwnd is hardcoded to 10.

Now we allow it to be controlled via
  /proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100

This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.

See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf

Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>

Conflicts:
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/tcp_input.c
2016-02-16 13:51:43 -08:00
Robert Love
38f0ec724f net: socket ioctl to reset connections matching local address
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.

Signed-off-by: Brian Swetland <swetland@google.com>

net: fix tcp_v4_nuke_addr

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.

We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix race in tcp_v4_nuke_addr().

To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.

Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.

Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>

ipv4: disable bottom halves around call to tcp_done().

Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>

ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr

When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close.  When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.

This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.

Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:51:15 -08:00
Eric Dumazet
5e0724d027 tcp/dccp: fix hashdance race for passive sessions
Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 05:42:21 -07:00
Yuchung Cheng
4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng
659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng
f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Eric Dumazet
dc6ef6be52 tcp: do not set queue_mapping on SYNACK
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:26:02 -07:00
Eric Dumazet
ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet
aa3a0c8ce6 tcp: get_openreq[46]() changes
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet
f964629e33 tcp: constify tcp_v{4|6}_route_req() sock argument
These functions do not change the listener socket.
Goal is to make sure tcp_conn_request() is not messing with
listener in a racy way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
3f684b4b1f tcp: cookie_init_sequence() cleanups
Some common IPv4/IPv6 code can be factorized.
Also constify cookie_init_sequence() socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
0c27171e66 tcp/dccp: constify syn_recv_sock() method sock argument
We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
c28c6f0459 tcp: constify tcp_create_openreq_child() socket argument
This method does not touch the listener socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:09 -07:00
Eric Dumazet
72ab4a86f7 tcp: remove tcp_rcv_state_process() tcp_hdr argument
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet
bda07a64c0 tcp: remove unused len argument from tcp_rcv_state_process()
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29 16:53:07 -07:00
Eric Dumazet
7c85af8810 tcp: avoid reorders for TFO passive connections
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like

[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>

packets [2] and [3] can be generated at almost the same time.

If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.

S will have to retransmit the answer.

Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.

It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.

This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.

This removes the reorder source at the host level.

We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:11:19 -07:00
Eric Dumazet
ea3bea3a1d tcp/dccp: constify rtx_synack() and friends
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
0f935dbedc tcp: constify tcp_v{4|6}_send_synack() socket argument
This documents fact that listener lock might not be held
at the time SYNACK are sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
5d062de7f8 tcp: constify tcp_make_synack() socket argument
listener socket is not locked when tcp_make_synack() is called.

We better make sure no field is written.

There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
b83e3deb97 tcp: md5: constify tcp_md5_do_lookup() socket argument
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
b1964b5fce tcp: constify tcp_openreq_init_rwin()
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Eric Dumazet
b40cf18ef7 tcp: constify listener socket in tcp_v[46]_init_req()
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Yuchung Cheng
0f1c28ae74 tcp: usec resolution SYN/ACK RTT
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.

This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.

For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)

For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.

If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().

One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Daniel Borkmann
c3a8d94746 tcp: use dctcp if enabled on the route to the initiator
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:34:00 -07:00
Eric Dumazet
43e122b014 tcp: refine pacing rate determination
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.

At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.

We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :

- After cwnd reduction, it is safer to ramp up more slowly,
  as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
  cwnd every other RTT.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:33:54 -07:00
Eric Dumazet
6f021c62d6 tcp: fix slow start after idle vs TSO/GSO
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.

With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.

Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.

As Neal pointed out, we also need to perform the check
if/when receive window opens.

Tested:

Following packetdrill test demonstrates the problem
// Test of slow start after idle

`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0    accept(3, ..., ...) = 4
+0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0

+0    write(4, ..., 26000) = 26000
+0    > . 1:5001(5000) ack 1
+0    > . 5001:10001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

+.100 < . 1:1(0) ack 10001 win 511
+0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0    > . 10001:20001(10000) ack 1
+0    > P. 20001:26001(6000) ack 1

+.100 < . 1:1(0) ack 26001 win 511
+0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%

+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0    > . 26001:31001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0    > . 31001:36001(5000) ack 1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 11:22:50 -07:00
Yuchung Cheng
76174004a0 tcp: do not slow start when cwnd equals ssthresh
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Yuchung Cheng
071d5080e3 tcp: add tcp_in_slow_start helper
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Eric Dumazet
f69ad292cf tcp: fill shinfo->gso_size at last moment
In commit cd7d8498c9 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.

This patch does the same for gso_size.

This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11 16:33:11 -07:00
Eric Dumazet
b80c0e7858 tcp: get_cookie_sock() consolidation
IPv4 and IPv6 share same implementation of get_cookie_sock(),
and there is no point inlining it.

We add tcp_ prefix to the common helper name and export it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07 15:19:52 -07:00
Daniel Borkmann
492135557d tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:53:37 -04:00
Eric Dumazet
b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
a6c5ea4ccf tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
264ea103a7 tcp: syncookies: extend validity range
Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :

When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.

This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.

Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.

So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:32:17 -04:00
Eric Dumazet
e520af48c7 tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
Diagnosing problems related to Window Probes has been hard because
we lack a counter.

TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.

TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:42:32 -04:00
Eric Dumazet
21c8fe9915 tcp: adjust window probe timers to safer values
With the advent of small rto timers in datacenter TCP,
(ip route ... rto_min x), the following can happen :

1) Qdisc is full, transmit fails.

   TCP sets a timer based on icsk_rto to retry the transmit, without
   exponential backoff.
   With low icsk_rto, and lot of sockets, all cpus are servicing timer
   interrupts like crazy.
   Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
   and 500ms (TCP_RESOURCE_PROBE_INTERVAL)

2) Receivers can send zero windows if they don't drain their receive queue.

   TCP sends zero window probes, based on icsk_rto current value, with
   exponential backoff.
   With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
   some cases), sender can abort in less than one or two minutes !
   If receiver stops the sender, it obviously doesn't care of very tight
   rto. Probability of dropping the ACK reopening the window is not
   worth the risk.

Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
events (but not normal RTO based retransmits)

A followup patch adds a new SNMP counter, as it would have helped a lot
diagnosing this issue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:42:32 -04:00
Eric Dumazet
64f40ff5bb tcp: prepare CC get_info() access from getsockopt()
We would like that optional info provided by Congestion Control
modules using netlink can also be read using getsockopt()

This patch changes get_info() to put this information in a buffer,
instead of skb, like tcp_get_info(), so that following patch
can reuse this common infrastructure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:38 -04:00
Eric Dumazet
0df48c26d8 tcp: add tcpi_bytes_acked to tcp_info
This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.

RFC4898 named this : tcpEStatsAppHCThruOctetsAcked

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:37 -04:00
Eric Dumazet
521f1cf1db inet_diag: fix access to tcp cc information
Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
   icsk->icsk_ca_ops can change under us and module be unloaded.
   -> Access to freed memory.
   Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
   but again this is not safe against icsk->icsk_ca_ops
   change and nla_put() errors were ignored. Some sockets
   could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:28:31 -04:00
Daniel Lee
2646c831c0 tcp: RFC7413 option support for Fast Open client
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
Daniel Lee
7f9b838b71 tcp: RFC7413 option support for Fast Open server
Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.

The change has passed all existing Fast Open tests with both
old and new options at Google.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
Eric Dumazet
41d25fe092 tcp: tcp_syn_flood_action() can be static
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.

We can make it static, by moving it above tcp_conn_request()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:17:18 -07:00
Eric Dumazet
fd3a154a00 tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()
With request socks convergence, we no longer need
different lookup methods. A request socket can
use generic lookup function.

Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet
39f8e58e53 tcp: md5: remove request sock argument of calc_md5_hash()
Since request and established sockets now have same base,
there is no need to pass two pointers to tcp_v4_md5_hash_skb()
or tcp_v6_md5_hash_skb()

Also add a const qualifier to their struct tcp_md5sig_key argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet
26e3736090 ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00