Commit graph

4139 commits

Author SHA1 Message Date
Ravinder Konka
3319385988 net: core: Send only BIND and LISTEN events.
Currently DPM is listening for all netlink events
and there is battery drain because of that. Since,
DPM is interested only in BIND and LISTEN events,
changes are to process only BIND and LSTEN events
and skip the rest for the sockev client.

CRs-Fixed: 945890
Change-Id: Iae11027945b981538f9c16ae9d5cd1ecf88a3743
Signed-off-by: Susheel Yadagiri <syadagir@codeaurora.org>
Signed-off-by: Ravinder Konka <rkonka@codeaurora.org>
2016-03-23 21:19:16 -07:00
Ravinder Konka
57483d074c net: Reset NAPI bit if IPI failed
During hotplug if an RPS CPU goes offline,
then there is a possibility that the IPI
delivery to the RPS core might fail, this
happens in the cases when unruly drivers
use netif_rx API in the wrong context.

This happens due to two reasons

a) Firstly using netif_rx API in non preemptive
context leads to enough latencies that the IPI
delivery might fail to an RPS core. This is because
the softIRQ trigger will become unpredictable.

b) by using netif_rx it  becomes an architectural
issue where we are trying to do two things in two
different contexts. We set the NAPI bit in context
and sent the IPI in other context. Now since the
context switch is allowed, the remote CPU is allowed
to go finish its hotplug.

If there was no context switch in the first place,
which typically happens by either using the correct
version of netif_rx or switching to NAPI framework,
then the remote CPU is not allowed to go to CPU DOWN
state. This is by design since hotplug framework causes
 the remote dying CPU to wait until atleast one context
switch happens on all other CPUS. If preemption is
disabled then the dying CPU has to wait until preemption
is enabled and a context switch happens.

This patch catches these unruly drivers and handles
IPI misses by clearing NAPI sate on remote RPS CPUs

Please refere here for more documentation on hotplug
and preemption cases https://lwn.net/Articles/569686/

CRs-Fixed: 966095
Change-Id: I072f91bdb4d7e444e3624e8e010ef1b66a67b1ed
Acked-by: Abhishek Chauhan <abchauha@qti.qualcomm.com>
Signed-off-by: Ravinder Konka <rkonka@codeaurora.org>
2016-03-23 21:18:59 -07:00
Jeevan Shriram
105850528e net: core: fix compilation warning for uninitialized variable
It is possible that the 'tail' variable is used without initialization.
This change fixes uninitialized variable usage.

Change-Id: Idbd7d52797af2eeffcece19249055d5099a7fdb1
Signed-off-by: Jeevan Shriram <jshriram@codeaurora.org>
2016-03-23 21:12:08 -07:00
David Keitel
f2b1fed1bd Merge remote-tracking branch 'lsk-44/linux-linaro-lsk-v4.4' into 44rc2
* lsk-44/linux-linaro-lsk-v4.4:
  Linux 4.4.3
  modules: fix modparam async_probe request
  module: wrapper for symbol name.
  itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
  posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
  timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
  prctl: take mmap sem for writing to protect against others
  xfs: log mount failures don't wait for buffers to be released
  Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
  xfs: inode recovery readahead can race with inode buffer creation
  libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
  ovl: setattr: check permissions before copy-up
  ovl: root: copy attr
  ovl: check dentry positiveness in ovl_cleanup_whiteouts()
  ovl: use a minimal buffer in ovl_copy_xattr
  ovl: allow zero size xattr
  futex: Drop refcount if requeue_pi() acquired the rtmutex
  devm_memremap_release(): fix memremap'd addr handling
  ipc/shm: handle removed segments gracefully in shm_mmap()
  intel_scu_ipcutil: underflow in scu_reg_access()
  mm,thp: khugepaged: call pte flush at the time of collapse
  dump_stack: avoid potential deadlocks
  radix-tree: fix oops after radix_tree_iter_retry
  drivers/hwspinlock: fix race between radix tree insertion and lookup
  radix-tree: fix race in gang lookup
  MAINTAINERS: return arch/sh to maintained state, with new maintainers
  memcg: only free spare array when readers are done
  numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390
  fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list()
  scripts/bloat-o-meter: fix python3 syntax error
  dma-debug: switch check from _text to _stext
  m32r: fix m32104ut_defconfig build fail
  xhci: Fix list corruption in urb dequeue at host removal
  Revert "xhci: don't finish a TD if we get a short-transfer event mid TD"
  iommu/vt-d: Clear PPR bit to ensure we get more page request interrupts
  iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG
  iommu/vt-d: Fix mm refcounting to hold mm_count not mm_users
  iommu/amd: Correct the wrong setting of alias DTE in do_attach
  iommu/vt-d: Don't skip PCI devices when disabling IOTLB
  Input: vmmouse - fix absolute device registration
  string_helpers: fix precision loss for some inputs
  Input: i8042 - add Fujitsu Lifebook U745 to the nomux list
  Input: elantech - mark protocols v2 and v3 as semi-mt
  mm: fix regression in remap_file_pages() emulation
  mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
  mm: fix mlock accouting
  libnvdimm: fix namespace object confusion in is_uuid_busy()
  mm: soft-offline: check return value in second __get_any_page() call
  perf kvm record/report: 'unprocessable sample' error while recording/reporting guest data
  KVM: PPC: Fix ONE_REG AltiVec support
  KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
  KVM: arm/arm64: Fix reference to uninitialised VGIC
  arm64: dma-mapping: fix handling of devices registered before arch_initcall
  ARM: OMAP2+: Fix ppa_zero_params and ppa_por_params for rodata
  ARM: OMAP2+: Fix save_secure_ram_context for rodata
  ARM: OMAP2+: Fix l2dis_3630 for rodata
  ARM: OMAP2+: Fix l2_inv_api_params for rodata
  ARM: OMAP2+: Fix wait_dll_lock_timed for rodata
  ARM: dts: at91: sama5d4ek: add phy address and IRQ for macb0
  ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type
  ARM: dts: at91: sama5d4: fix instance id of DBGU
  ARM: dts: at91: sama5d4 xplained: properly mux phy interrupt
  ARM: dts: omap5-board-common: enable rtc and charging of backup battery
  ARM: dts: Fix omap5 PMIC control lines for RTC writes
  ARM: dts: Fix wl12xx missing clocks that cause hangs
  ARM: nomadik: fix up SD/MMC DT settings
  ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz()
  ARM: 8519/1: ICST: try other dividends than 1
  arm64: mm: avoid calling apply_to_page_range on empty range
  ARM: mvebu: remove duplicated regulator definition in Armada 388 GP
  powerpc/ioda: Set "read" permission when "write" is set
  powerpc/powernv: Fix stale PE primary bus
  powerpc/eeh: Fix stale cached primary bus
  powerpc/eeh: Fix PE location code
  SUNRPC: Fixup socket wait for memory
  udf: Check output buffer length when converting name to CS0
  udf: Prevent buffer overrun with multi-byte characters
  udf: limit the maximum number of indirect extents in a row
  pNFS/flexfiles: Fix an XDR encoding bug in layoutreturn
  nfs: Fix race in __update_open_stateid()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  cifs: fix erroneous return value
  cifs_dbg() outputs an uninitialized buffer in cifs_readdir()
  cifs: fix race between call_async() and reconnect()
  cifs: Ratelimit kernel log messages
  iio: inkern: fix a NULL dereference on error
  iio: pressure: mpl115: fix temperature offset sign
  iio: light: acpi-als: Report data as processed
  iio: dac: mcp4725: set iio name property in sysfs
  iio: add IIO_TRIGGER dependency to STK8BA50
  iio: add HAS_IOMEM dependency to VF610_ADC
  iio-light: Use a signed return type for ltr501_match_samp_freq()
  iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer.
  iio: adis_buffer: Fix out-of-bounds memory access
  scsi: fix soft lockup in scsi_remove_target() on module removal
  SCSI: Add Marvell Console to VPD blacklist
  scsi_dh_rdac: always retry MODE SELECT on command lock violation
  drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
  SCSI: fix crashes in sd and sr runtime PM
  iscsi-target: Fix potential dead-lock during node acl delete
  scsi: add Synology to 1024 sector blacklist
  klist: fix starting point removed bug in klist iterators
  tracepoints: Do not trace when cpu is offline
  tracing: Fix freak link error caused by branch tracer
  perf tools: tracepoint_error() can receive e=NULL, robustify it
  tools lib traceevent: Fix output of %llu for 64 bit values read on 32 bit machines
  ptrace: use fsuid, fsgid, effective creds for fs access checks
  Btrfs: fix direct IO requests not reporting IO error to user space
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
  btrfs: properly set the termination value of ctx->pos in readdir
  Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
  Btrfs: fix fitrim discarding device area reserved for boot loader's use
  btrfs: handle invalid num_stripes in sys_array
  ext4: don't read blocks from disk after extents being swapped
  ext4: fix potential integer overflow
  ext4: fix scheduling in atomic on group checksum failure
  serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485)
  serial: 8250_pci: Add Intel Broadwell ports
  tty: Add support for PCIe WCH382 2S multi-IO card
  pty: make sure super_block is still valid in final /dev/tty close
  pty: fix possible use after free of tty->driver_data
  staging/speakup: Use tty_ldisc_ref() for paste kworker
  phy: twl4030-usb: Fix unbalanced pm_runtime_enable on module reload
  phy: twl4030-usb: Relase usb phy on unload
  ALSA: seq: Fix double port list deletion
  ALSA: seq: Fix leak of pool buffer at concurrent writes
  ALSA: pcm: Fix rwsem deadlock for non-atomic PCM stream
  ALSA: hda - Cancel probe work instead of flush at remove
  x86/mm: Fix vmalloc_fault() to handle large pages properly
  x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
  x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
  x86/mm/pat: Avoid truncation when converting cpa->numpages to address
  x86/mm: Fix types used in pgprot cacheability flags translations
  Linux 4.4.2
  HID: multitouch: fix input mode switching on some Elan panels
  mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
  zsmalloc: fix migrate_zspage-zs_free race condition
  zram: don't call idr_remove() from zram_remove()
  zram: try vmalloc() after kmalloc()
  zram/zcomp: use GFP_NOIO to allocate streams
  rtlwifi: rtl8821ae: Fix 5G failure when EEPROM is incorrectly encoded
  rtlwifi: rtl8821ae: Fix errors in parameter initialization
  crypto: marvell/cesa - fix test in mv_cesa_dev_dma_init()
  crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts
  crypto: atmel-sha - fix atmel_sha_remove()
  crypto: algif_skcipher - Do not set MAY_BACKLOG on the async path
  crypto: algif_skcipher - Do not dereference ctx without socket lock
  crypto: algif_skcipher - Do not assume that req is unchanged
  crypto: user - lock crypto_alg_list on alg dump
  EVM: Use crypto_memneq() for digest comparisons
  crypto: algif_hash - wait for crypto_ahash_init() to complete
  crypto: shash - Fix has_key setting
  crypto: chacha20-ssse3 - Align stack pointer to 64 bytes
  crypto: caam - make write transactions bufferable on PPC platforms
  crypto: algif_skcipher - sendmsg SG marking is off by one
  crypto: algif_skcipher - Load TX SG list after waiting
  crypto: crc32c - Fix crc32c soft dependency
  crypto: algif_skcipher - Fix race condition in skcipher_check_key
  crypto: algif_hash - Fix race condition in hash_check_key
  crypto: af_alg - Forbid bind(2) when nokey child sockets are present
  crypto: algif_skcipher - Remove custom release parent function
  crypto: algif_hash - Remove custom release parent function
  crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path
  ahci: Intel DNV device IDs SATA
  libata: disable forced PORTS_IMPL for >= AHCI 1.3
  crypto: algif_skcipher - Add key check exception for cipher_null
  crypto: skcipher - Add crypto_skcipher_has_setkey
  crypto: algif_hash - Require setkey before accept(2)
  crypto: hash - Add crypto_ahash_has_setkey
  crypto: algif_skcipher - Add nokey compatibility path
  crypto: af_alg - Add nokey compatibility path
  crypto: af_alg - Fix socket double-free when accept fails
  crypto: af_alg - Disallow bind/setkey/... after accept(2)
  crypto: algif_skcipher - Require setkey before accept(2)
  sched: Fix crash in sched_init_numa()
  ext4 crypto: add missing locking for keyring_key access
  iommu/io-pgtable-arm: Ensure we free the final level on teardown
  tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
  tty: Retry failed reopen if tty teardown in-progress
  tty: Wait interruptibly for tty lock on reopen
  n_tty: Fix unsafe reference to "other" ldisc
  usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms
  usb: xhci: handle both SSIC ports in PME stuck quirk
  usb: phy: msm: fix error handling in probe.
  usb: cdc-acm: send zero packet for intel 7260 modem
  usb: cdc-acm: handle unlinked urb in acm read callback
  USB: option: fix Cinterion AHxx enumeration
  USB: serial: option: Adding support for Telit LE922
  USB: cp210x: add ID for IAI USB to RS485 adaptor
  USB: serial: ftdi_sio: add support for Yaesu SCU-18 cable
  usb: hub: do not clear BOS field during reset device
  USB: visor: fix null-deref at probe
  USB: serial: visor: fix crash on detecting device without write_urbs
  ASoC: rt5645: fix the shift bit of IN1 boost
  saa7134-alsa: Only frees registered sound cards
  ALSA: dummy: Implement timer backend switching more safely
  ALSA: hda - Fix bad dereference of jack object
  ALSA: hda - Fix speaker output from VAIO AiO machines
  Revert "ALSA: hda - Fix noise on Gigabyte Z170X mobo"
  ALSA: hda - Fix static checker warning in patch_hdmi.c
  ALSA: hda - Add fixup for Mac Mini 7,1 model
  ALSA: timer: Fix race between stop and interrupt
  ALSA: timer: Fix wrong instance passed to slave callbacks
  ALSA: timer: Fix race at concurrent reads
  ALSA: timer: Fix link corruption due to double start or stop
  ALSA: timer: Fix leftover link at closing
  ALSA: timer: Code cleanup
  ALSA: seq: Fix lockdep warnings due to double mutex locks
  ALSA: seq: Fix race at closing in virmidi driver
  ALSA: seq: Fix yet another races among ALSA timer accesses
  ASoC: dpcm: fix the BE state on hw_free
  ALSA: pcm: Fix potential deadlock in OSS emulation
  ALSA: hda/realtek - Support Dell headset mode for ALC225
  ALSA: hda/realtek - Support headset mode for ALC225
  ALSA: hda/realtek - New codec support of ALC225
  ALSA: rawmidi: Fix race at copying & updating the position
  ALSA: rawmidi: Remove kernel WARNING for NULL user-space buffer check
  ALSA: rawmidi: Make snd_rawmidi_transmit() race-free
  ALSA: seq: Degrade the error message for too many opens
  ALSA: seq: Fix incorrect sanity check at snd_seq_oss_synth_cleanup()
  ALSA: dummy: Disable switching timer backend via sysfs
  ALSA: compress: Disable GET_CODEC_CAPS ioctl for some architectures
  ALSA: hda - disable dynamic clock gating on Broxton before reset
  ALSA: Add missing dependency on CONFIG_SND_TIMER
  ALSA: bebob: Use a signed return type for get_formation_index
  ALSA: usb-audio: avoid freeing umidi object twice
  ALSA: usb-audio: Add native DSD support for PS Audio NuWave DAC
  ALSA: usb-audio: Fix OPPO HA-1 vendor ID
  ALSA: usb-audio: Add quirk for Microsoft LifeCam HD-6000
  ALSA: usb-audio: Fix TEAC UD-501/UD-503/NT-503 usb delay
  hrtimer: Handle remaining time proper for TIME_LOW_RES
  md/raid: only permit hot-add of compatible integrity profiles
  media: i2c: Don't export ir-kbd-i2c module alias
  parisc: Fix __ARCH_SI_PREAMBLE_SIZE
  parisc: Protect huge page pte changes with spinlocks
  printk: do cond_resched() between lines while outputting to consoles
  tracing/stacktrace: Show entire trace if passed in function not found
  tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
  PCI: Fix minimum allocation address overwrite
  PCI: host: Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD
  mtd: nand: assign reasonable default name for NAND drivers
  wlcore/wl12xx: spi: fix NULL pointer dereference (Oops)
  wlcore/wl12xx: spi: fix oops on firmware load
  ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
  ocfs2/dlm: ignore cleaning the migration mle that is inuse
  ALSA: hda - Implement loopback control switch for Realtek and other codecs
  block: fix bio splitting on max sectors
  base/platform: Fix platform drivers with no probe callback
  HID: usbhid: fix recursive deadlock
  ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock
  block: split bios to max possible length
  NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
  crypto: sun4i-ss - add missing statesize
  Linux 4.4.1
  arm64: kernel: fix architected PMU registers unconditional access
  arm64: kernel: enforce pmuserenr_el0 initialization and restore
  arm64: mm: ensure that the zero page is visible to the page table walker
  arm64: Clear out any singlestep state on a ptrace detach operation
  powerpc/module: Handle R_PPC64_ENTRY relocations
  scripts/recordmcount.pl: support data in text section on powerpc
  powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered
  powerpc: Make value-returning atomics fully ordered
  powerpc/tm: Check for already reclaimed tasks
  batman-adv: Drop immediate orig_node free function
  batman-adv: Drop immediate batadv_hard_iface free function
  batman-adv: Drop immediate neigh_ifinfo free function
  batman-adv: Drop immediate batadv_neigh_node free function
  batman-adv: Drop immediate batadv_orig_ifinfo free function
  batman-adv: Avoid recursive call_rcu for batadv_nc_node
  batman-adv: Avoid recursive call_rcu for batadv_bla_claim
  team: Replace rcu_read_lock with a mutex in team_vlan_rx_kill_vid
  net/mlx5_core: Fix trimming down IRQ number
  bridge: fix lockdep addr_list_lock false positive splat
  ipv6: update skb->csum when CE mark is propagated
  net: bpf: reject invalid shifts
  phonet: properly unshare skbs in phonet_rcv()
  dwc_eth_qos: Fix dma address for multi-fragment skbs
  bonding: Prevent IPv6 link local address on enslaved devices
  net: preserve IP control block during GSO segmentation
  udp: disallow UFO for sockets with SO_NO_CHECK option
  net: pktgen: fix null ptr deref in skb allocation
  sched,cls_flower: set key address type when present
  tcp_yeah: don't set ssthresh below 2
  ipv6: tcp: add rcu locking in tcp_v6_send_synack()
  net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory
  vxlan: fix test which detect duplicate vxlan iface
  unix: properly account for FDs passed over unix sockets
  xhci: refuse loading if nousb is used
  usb: core: lpm: fix usb3_hardware_lpm sysfs node
  USB: cp210x: add ID for ELV Marble Sound Board 1
  rtlwifi: fix memory leak for USB device
  ASoC: compress: Fix compress device direction check
  ASoC: wm5110: Fix PGA clear when disabling DRE
  ALSA: timer: Handle disconnection more safely
  ALSA: hda - Flush the pending probe work at remove
  ALSA: hda - Fix missing module loading with model=generic option
  ALSA: hda - Fix bass pin fixup for ASUS N550JX
  ALSA: control: Avoid kernel warnings from tlv ioctl with numid 0
  ALSA: hrtimer: Fix stall by hrtimer_cancel()
  ALSA: pcm: Fix snd_pcm_hw_params struct copy in compat mode
  ALSA: seq: Fix snd_seq_call_port_info_ioctl in compat mode
  ALSA: hda - Add fixup for Dell Latitidue E6540
  ALSA: timer: Fix double unlink of active_list
  ALSA: timer: Fix race among timer ioctls
  ALSA: hda - fix the headset mic detection problem for a Dell laptop
  ALSA: timer: Harden slave timer list handling
  ALSA: usb-audio: Fix mixer ctl regression of Native Instrument devices
  ALSA: hda - Fix white noise on Dell Latitude E5550
  ALSA: seq: Fix race at timer setup and close
  ALSA: usb-audio: Avoid calling usb_autopm_put_interface() at disconnect
  ALSA: seq: Fix missing NULL check at remove_events ioctl
  ALSA: hda - Fixup inverted internal mic for Lenovo E50-80
  ALSA: usb: Add native DSD support for Oppo HA-1
  x86/mm: Improve switch_mm() barrier comments
  x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
  x86/boot: Double BOOT_HEAP_SIZE to 64KB
  x86/reboot/quirks: Add iMac10,1 to pci_reboot_dmi_table[]
  kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
  KVM: x86: correctly print #AC in traces
  KVM: x86: expose MSR_TSC_AUX to userspace
  x86/xen: don't reset vcpu_info on a cancelled suspend
  KEYS: Fix keyring ref leak in join_session_keyring()

Conflicts:
	arch/arm64/kernel/perf_event.c
	drivers/scsi/sd.c
	sound/core/compress_offload.c

Change-Id: I9f77fe42aaae249c24cd6e170202110ab1426878
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
2016-03-23 20:51:00 -07:00
Ravinder Konka
c60a91f21f skb: Adding trace event for gso.
This patch adds trace events to help with debug for gso feature
by identifying the packets(and their lenghts) that are using
the segmentation offload feature.

Change-Id: Ibfe1194cc63e74c75047040b0c540713d539992e
Acked-by: Ashwanth Goli <ashwanth@qti.qualcomm.com>
Signed-off-by: Ravinder Konka <rkonka@codeaurora.org>
2016-03-23 20:09:48 -07:00
Harout Hedeshian
e9a33bcd2f net: add a per-cpu counter for the number of frames coalesced in GRO
A low cost method of determining GRO statistics is required. This
change introduces a new counter which tracks whenever GRO coalesces
ingress packets. The counter is per-CPU and exposed in
/proc/net/softnet_stat as the last column of data. No user space
impact is expected as a result of this change. However, this change
should be reverted if legacy tools have problems with the new column
in softnet_stat.

Change-Id: I05965c0cb150947935d5977884cc4d583b37131d
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:09:45 -07:00
Harout Hedeshian
a511f282c9 net: sockev: corrected sk_family filter logic
Commit 43e0e31e2d6e ("net: sockev: filtering non INET socket events")
from Krishnan introduced incorrect conditional logic which caused
the socket address families to be incorrectly filtered. This
patch corrects the logic.

CRs-Fixed: 830947
Cc: Krishnan Ramachandran <kramacha@qti.qualcomm.com>
Acked-by: Devesh Bisht <dbisht@qti.qualcomm.com>
Change-Id: I40a001a69d5aab25f7f97a7378aceae301fd762a
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
2016-03-22 11:09:43 -07:00
Subash Abhinov Kasiviswanathan
e1f88edd76 net: Warn for cloned packets in ingress path
Cloned packets arriving in ingress path can cause issues with GRO
since the skb shared info is garbled.

Warn once if a cloned packet is queued up to the network stack.

CRs-Fixed: 823275
Change-Id: I049f04f39b3d1338838560e08c93a973de427fc0
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:09:42 -07:00
Harout Hedeshian
bf669ab2dd net: sockev: filtering non INET socket events
Too many socket events are generated by netlink socket. Filtering out
unwanted socket events.

Change-Id: I3a4d7e14843cf72d6af2d948113b27928ed01adb
Acked-by: Krishnan Ramachandran <kramacha@qti.qualcomm.com>
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
2016-03-22 11:09:40 -07:00
Harout Hedeshian
274f3cfdd0 net: sockev: Initial Commit
Added module which subscribes to socket notifier events. Notifier events
are then converted to a multicast netlink message for user space
applications to consume.

CRs-Fixed:  626021
Change-Id: Id5c6808d972b69f5f065d7fba9094e75c6ad0b2c
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:09:38 -07:00
Harout Hedeshian
1afe1dec58 net: core: add MAP support to RPS flow dissector
More efficiently utilize multiple cores when using MAP encoded frames.
RPS flow dissector now appropriately decodes IPv4 and IPv6 frames with
MAP headers prepended.

CRs-Fixed: 681280
Change-Id: Ia4dde47fcc0f3436dcaa71a5160c0a59fe7ed53a
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:09:37 -07:00
Subash Abhinov Kasiviswanathan
0dc9f20694 net: Change the current NAPI context to use latest API
Commit 47405a253d ("percpu: Remove __this_cpu_ptr") and commit
6c51ec4d18 ("percpu: remove __get_cpu_var and __raw_get_cpu_var
macros") removed __this_cpu_ptr which is needed to access current
NAPI. Use this_cpu_ptr instead.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:05:57 -07:00
Subash Abhinov Kasiviswanathan
09a36c1368 net: Add the get current NAPI context API
Commit 69235aa80090 ("net: Remove the get current NAPI context API")
removed the definition of get_current_napi_context() as rmnet_data
was no longer using it. However, the rmnet_data change to use its
NAPI in multiple contexts was prone to race in hotplug scenarios.

Add back get_current_napi_context() and current_napi to the
softnet_data struct

CRs-Fixed: 966095
Change-Id: I7cf1c5e39a5ccbd7a74a096b11efd179a4d0d034
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:05:56 -07:00
Lorenzo Colitti
f340b7c9ec net: diag: Add the ability to destroy a socket.
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer.  It does not include any implementation code.

[backport of net-next 64be0aed59ad519d6f2160868734f7e278290ac1]

Change-Id: Ic5327ff14b39dd268083ee4c1dc2c934b2820df5
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Lorenzo Colitti
fd2cf795f3 net: core: Support UID-based routing.
This contains the following commits:

1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
    This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
    IPv4 PMTUD already does this because ipv4_sk_update_pmtu
    uses __build_flow_key, which includes the UID.

Bug: 15413527
Change-Id: Iae3d4ca3979d252b6cec989bdc1a6875f811f03a
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2016-02-16 13:51:37 -08:00
Rabin Vincent
35987ff2ea net: bpf: reject invalid shifts
[ Upstream commit 229394e8e62a4191d592842cf67e80c62a492937 ]

On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:01 -08:00
Konstantin Khlebnikov
1f0bdf6092 net: preserve IP control block during GSO segmentation
[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:00 -08:00
John Fastabend
4d90b0d1c1 net: pktgen: fix null ptr deref in skb allocation
[ Upstream commit 3de03596dfeee48bc803c1d1a6daf60a459929f3 ]

Fix possible null pointer dereference that may occur when calling
skb_reserve() on a null skb.

Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:00 -08:00
Francesco Ruggeri
07a5d38453 net: possible use after free in dst_release
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.

Fixes: d69bbf88c8 ("net: fix a race in dst_release()")
Fixes: 27b75c95f1 ("net: avoid RCU for NOCACHE dst")
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 15:00:27 -05:00
Linus Torvalds
73796d8bf2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of
    people reported this...  From Arnd Bergmann.

 2) Don't init mutex twice in i40e driver, from Jesse Brandeburg.

 3) Fix spurious EBUSY in rhashtable, from Herbert Xu.

 4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas.

 5) Fix race with work structure access in pppoe driver causing
    corruptions, from Guillaume Nault.

 6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb()
    actually succeeded or not, from Sergei Shtylyov.

 7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from
    Bjørn Mork.

 8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc.

 9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo
    Leitner.

10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven.

11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling
    properly as well, from Jiri Benc.

12) Handle request sockets properly in xfrm layer, from Eric Dumazet.

13) Double stats update in ipv6 geneve transmit path, fix from Pravin B
    Shelar.

14) sk->sk_policy[] needs RCU protection, and as a result
    xfrm_policy_destroy() needs to free policies using an RCU grace
    period, from Eric Dumazet.

15) SCTP needs to clone ipv6 tx options in order to avoid use after
    free, from Eric Dumazet.

16) Missing kbuild export if ila.h, from Stephen Hemminger.

17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from
    Tobias Klauser.

18) Validate protocol value range in ->create() methods, from Hannes
    Frederic Sowa.

19) Fix early socket demux races that result in illegal dst reuse, from
    Eric Dumazet.

20) Validate socket address length in pptp code, from WANG Cong.

21) skb_reorder_vlan_header() uses incorrect offset and can corrupt
    packets, from Vlad Yasevich.

22) Fix memory leaks in nl80211 registry code, from Ola Olsson.

23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and
    qlcnic.  From Dan Carpenter.

24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for
    example, AF_ALG will interpret it as an async call.  From Tadeusz
    Struk.

25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from
    Eric Dumazet.

26) rhashtable enforces the minimum table size not early enough,
    breaking how we calculate the per-cpu lock allocations.  From
    Herbert Xu.

27) Fix FCC port lockup in 82xx driver, from Martin Roth.

28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa.

29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and
    sock_setsockopt() wrt.  timestamp handling.  From WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits)
  net: check both type and procotol for tcp sockets
  drivers: net: xgene: fix Tx flow control
  tcp: restore fastopen with no data in SYN packet
  af_unix: Revert 'lock_interruptible' in stream receive code
  fou: clean up socket with kfree_rcu
  82xx: FCC: Fixing a bug causing to FCC port lock-up
  gianfar: Don't enable RX Filer if not supported
  net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
  rhashtable: Fix walker list corruption
  rhashtable: Enforce minimum size on initial hash table
  inet: tcp: fix inetpeer_set_addr_v4()
  ipv6: automatically enable stable privacy mode if stable_secret set
  net: fix uninitialized variable issue
  bluetooth: Validate socket address length in sco_sock_bind().
  net_sched: make qdisc_tree_decrease_qlen() work for non mq
  ser_gigaset: remove unnecessary kfree() calls from release method
  ser_gigaset: fix deallocation of platform device structure
  ser_gigaset: turn nonsense checks into WARN_ON
  ser_gigaset: fix up NULL checks
  qlcnic: fix a timeout loop
  ...
2015-12-17 14:05:22 -08:00
WANG Cong
ac5cc97799 net: check both type and procotol for tcp sockets
Dmitry reported the following out-of-bound access:

Call Trace:
 [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
mm/kasan/report.c:294
 [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
 [<     inline     >] SYSC_setsockopt net/socket.c:1746
 [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
 [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185

This is because we mistake a raw socket as a tcp socket.
We should check both sk->sk_type and sk->sk_protocol to ensure
it is a tcp socket.

Willem points out __skb_complete_tx_timestamp() needs to fix as well.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:46:32 -05:00
Vlad Yasevich
f654861569 skbuff: Fix offset error in skb_reorder_vlan_header
skb_reorder_vlan_header is called after the vlan header has
been pulled.  As a result the offset of the begining of
the mac header has been incrased by 4 bytes (VLAN_HLEN).
When moving the mac addresses, include this incrase in
the offset calcualation so that the mac addresses are
copied correctly.

Fixes: a6e18ff111 (vlan: Fix untag operations of stacked vlans with REORDER_HEADER off)
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vladislav Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 00:30:41 -05:00
Eric Dumazet
d188ba86dd xfrm: add rcu protection to sk->sk_policy[]
XFRM can deal with SYNACK messages, sent while listener socket
is not locked. We add proper rcu protection to __xfrm_sk_clone_policy()
and xfrm_sk_policy_lookup()

This might serve as the first step to remove xfrm.xfrm_policy_lock
use in fast path.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11 19:22:06 -05:00
Tejun Heo
0b98f0c042 Merge branch 'master' into for-4.4-fixes
The following commit which went into mainline through networking tree

  3b13758f51 ("cgroups: Allow dynamically changing net_classid")

conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.

  1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths.  The latter
drops @css from cgrp_attach().

Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task.  We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Nina Schiff <ninasc@fb.com>
2015-12-07 10:09:03 -05:00
Marcelo Ricardo Leitner
01ce63c901 sctp: update the netstamp_needed counter when copying sockets
Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy
related to disabling sock timestamp.

When SCTP accepts an association or peel one off, it copies sock flags
but forgot to call net_enable_timestamp() if a packet timestamping flag
was copied, leading to extra calls to net_disable_timestamp() whenever
such clones were closed.

The fix is to call net_enable_timestamp() whenever we copy a sock with
that flag on, like tcp does.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05 22:23:22 -05:00
Eric Dumazet
6bd4f355df ipv6: kill sk_dst_lock
While testing the np->opt RCU conversion, I found that UDP/IPv6 was
using a mixture of xchg() and sk_dst_lock to protect concurrent changes
to sk->sk_dst_cache, leading to possible corruptions and crashes.

ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
way to fix the mess is to remove sk_dst_lock completely, as we did for
IPv4.

__ip6_dst_store() and ip6_dst_store() share same implementation.

sk_setup_caps() being called with socket lock being held or not,
we have to use sk_dst_set() instead of __sk_dst_set()

Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
in ip6_dst_store() before the sk_setup_caps(sk, dst) call.

This is because ip6_dst_store() can be called from process context,
without any lock held.

As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
from another cpu doing a concurrent ip6_dst_store()

Doing the dst dereference before doing the install is needed to make
sure no use after free would trigger.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:32:06 -05:00
Tejun Heo
1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
Konstantin Khlebnikov
6adc5fd6a1 net/neighbour: fix crash at dumping device-agnostic proxy entries
Proxy entries could have null pointer to net-device.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Fixes: 84920c1420 ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 00:07:51 -05:00
Eric Dumazet
ceb5d58b21 net: fix sock_wake_async() rcu protection
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.

Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.

The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.

We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.

It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.

In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
Eric Dumazet
9cd3e072b0 net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
Nina Schiff
3b13758f51 cgroups: Allow dynamically changing net_classid
The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.

Signed-off-by: Nina Schiff <ninasc@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23 12:13:46 -05:00
Daniel Borkmann
6900317f5e net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds
David and HacKurx reported a following/similar size overflow triggered
in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:

(Already fixed in later grsecurity versions by Brad and PaX Team.)

[ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
               cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
[ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
[ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
[ 1002.296153]  ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
[ 1002.296162]  ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
[ 1002.296169]  ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
[ 1002.296176] Call Trace:
[ 1002.296190]  [<ffffffff818129ba>] dump_stack+0x45/0x57
[ 1002.296200]  [<ffffffff8121f838>] report_size_overflow+0x38/0x60
[ 1002.296209]  [<ffffffff816a979e>] scm_detach_fds+0x2ce/0x300
[ 1002.296220]  [<ffffffff81791899>] unix_stream_read_generic+0x609/0x930
[ 1002.296228]  [<ffffffff81791c9f>] unix_stream_recvmsg+0x4f/0x60
[ 1002.296236]  [<ffffffff8178dc00>] ? unix_set_peek_off+0x50/0x50
[ 1002.296243]  [<ffffffff8168fac7>] sock_recvmsg+0x47/0x60
[ 1002.296248]  [<ffffffff81691522>] ___sys_recvmsg+0xe2/0x1e0
[ 1002.296257]  [<ffffffff81693496>] __sys_recvmsg+0x46/0x80
[ 1002.296263]  [<ffffffff816934fc>] SyS_recvmsg+0x2c/0x40
[ 1002.296271]  [<ffffffff8181a3ab>] entry_SYSCALL_64_fastpath+0x12/0x85

Further investigation showed that this can happen when an *odd* number of
fds are being passed over AF_UNIX sockets.

In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
where i is the number of successfully passed fds, differ by 4 bytes due
to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
on 64 bit. The padding is used to align subsequent cmsg headers in the
control buffer.

When the control buffer passed in from the receiver side *lacks* these 4
bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
overflow in scm_detach_fds():

  int cmlen = CMSG_LEN(i * sizeof(int));  <--- cmlen w/o tail-padding
  err = put_user(SOL_SOCKET, &cm->cmsg_level);
  if (!err)
    err = put_user(SCM_RIGHTS, &cm->cmsg_type);
  if (!err)
    err = put_user(cmlen, &cm->cmsg_len);
  if (!err) {
    cmlen = CMSG_SPACE(i * sizeof(int));  <--- cmlen w/ 4 byte extra tail-padding
    msg->msg_control += cmlen;
    msg->msg_controllen -= cmlen;         <--- iff no tail-padding space here ...
  }                                            ... wrap-around

F.e. it will wrap to a length of 18446744073709551612 bytes in case the
receiver passed in msg->msg_controllen of 20 bytes, and the sender
properly transferred 1 fd to the receiver, so that its CMSG_LEN results
in 20 bytes and CMSG_SPACE in 24 bytes.

In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
issue in my tests as alignment seems always on 4 byte boundary. Same
should be in case of native 32 bit, where we end up with 4 byte boundaries
as well.

In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
a single fd would mean that on successful return, msg->msg_controllen is
being set by the kernel to 24 bytes instead, thus more than the input
buffer advertised. It could f.e. become an issue if such application later
on zeroes or copies the control buffer based on the returned msg->msg_controllen
elsewhere.

Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).

Going over the code, it seems like msg->msg_controllen is not being read
after scm_detach_fds() in scm_recv() anymore by the kernel, good!

Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
and ___sys_recvmsg() places the updated length, that is, new msg_control -
old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
in the example).

Long time ago, Wei Yongjun fixed something related in commit 1ac70e7ad2
("[NET]: Fix function put_cmsg() which may cause usr application memory
overflow").

RFC3542, section 20.2. says:

  The fields shown as "XX" are possible padding, between the cmsghdr
  structure and the data, and between the data and the next cmsghdr
  structure, if required by the implementation. While sending an
  application may or may not include padding at the end of last
  ancillary data in msg_controllen and implementations must accept both
  as valid. On receiving a portable application must provide space for
  padding at the end of the last ancillary data as implementations may
  copy out the padding at the end of the control message buffer and
  include it in the received msg_controllen. When recvmsg() is called
  if msg_controllen is too small for all the ancillary data items
  including any trailing padding after the last item an implementation
  may set MSG_CTRUNC.

Since we didn't place MSG_CTRUNC for already quite a long time, just do
the same as in 1ac70e7ad2 to avoid an overflow.

Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
error in SCM_RIGHTS code sample"). Some people must have copied this (?),
thus it got triggered in the wild (reported several times during boot by
David and HacKurx).

No Fixes tag this time as pre 2002 (that is, pre history tree).

Reported-by: David Sterba <dave@jikos.cz>
Reported-by: HacKurx <hackurx@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-22 20:34:58 -05:00
Nikolay Aleksandrov
17b85d29e8 net/core: revert "net: fix __netdev_update_features return.." and add comment
This reverts commit 00ee592717 ("net: fix __netdev_update_features return
on ndo_set_features failure")
and adds a comment explaining why it's okay to return a value other than
0 upon error. Some drivers might actually change flags and return an
error so it's better to fire a spurious notification rather than miss
these.

CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:45 -05:00
Hannes Frederic Sowa
b22b941b2c rtnetlink: fix frame size warning in rtnl_fill_ifinfo
Fix the following warning:

  CC      net/core/rtnetlink.o
net/core/rtnetlink.c: In function ‘rtnl_fill_ifinfo’:
net/core/rtnetlink.c:1308:1: warning: the frame size of 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
 }
 ^
by splitting up the huge rtnl_fill_ifinfo into some smaller ones, so we
don't have the huge frame allocations at the same time.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:44 -05:00
Martin Zhang
19125c1a4f net: use skb_clone to avoid alloc_pages failure.
1. new skb only need dst and ip address(v4 or v6).
2. skb_copy may need high order pages, which is very rare on long running server.

Signed-off-by: Junwei Zhang <linggao.zjw@alibaba-inc.com>
Signed-off-by: Martin Zhang <martinbj2008@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:44 -05:00
Vlad Yasevich
a6e18ff111 vlan: Fix untag operations of stacked vlans with REORDER_HEADER off
When we have multiple stacked vlan devices all of which have
turned off REORDER_HEADER flag, the untag operation does not
locate the ethernet addresses correctly for nested vlans.
The reason is that in case of REORDER_HEADER flag being off,
the outer vlan headers are put back and the mac_len is adjusted
to account for the presense of the header.  Then, the subsequent
untag operation, for the next level vlan, always use VLAN_ETH_HLEN
to locate the begining of the ethernet header and that ends up
being a multiple of 4 bytes short of the actuall beginning
of the mac header (the multiple depending on the how many vlan
encapsulations ethere are).

As a reslult, if there are multiple levles of vlan devices
with REODER_HEADER being off, the recevied packets end up
being dropped.

To solve this, we use skb->mac_len as the offset.  The value
is always set on receive path and starts out as a ETH_HLEN.
The value is also updated when the vlan header manupations occur
so we know it will be correct.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 14:38:35 -05:00
Bjørn Mork
88ad4175b2 net/core: use netdev name in warning if no parent
A recent flaw in the netdev feature setting resulted in warnings
like this one from VLAN interfaces:

 WARNING: CPU: 1 PID: 4975 at net/core/dev.c:2419 skb_warn_bad_offload+0xbc/0xcb()
 : caps=(0x00000000001b5820, 0x00000000001b5829) len=2782 data_len=0 gso_size=1348 gso_type=16 ip_summed=3

The ":" is supposed to be preceded by a driver name, but in this
case it is an empty string since the device has no parent.

There are many types of network devices without a parent. The
anonymous warnings for these devices can be hard to debug.  Log
the network device name instead in these cases to assist further
debugging.

This is mostly similar to how __netdev_printk() handles orphan
devices.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 16:21:48 -05:00
Nikolay Aleksandrov
00ee592717 net: fix __netdev_update_features return on ndo_set_features failure
If ndo_set_features fails __netdev_update_features() will return -1 but
this is wrong because it is expected to return 0 if no features were
changed (see netdev_update_features()), which will cause a netdev
notifier to be called without any actual changes. Fix this by returning
0 if ndo_set_features fails.

Fixes: 6cb6a27c45 ("net: Call netdev_features_change() from netdev_update_features()")
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 14:56:03 -05:00
Nikolay Aleksandrov
5f8dc33e8e net: fix feature changes on devices without ndo_set_features
When __netdev_update_features() was updated to ensure some features are
disabled on new lower devices, an error was introduced for devices which
don't have the ndo_set_features() method set. Before we'll just set the
new features, but now we return an error and don't set them. Fix this by
returning the old behaviour and setting err to 0 when ndo_set_features
is not present.

Fixes: e7868a85e1 ("net/core: ensure features get disabled on new lower devs")
CC: Jarod Wilson <jarod@redhat.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Ido Schimmel <idosch@mellanox.com>
CC: Sander Eikelenboom <linux@eikelenboom.it>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Dave Young <dyoung@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 14:56:03 -05:00
Linus Torvalds
2df4ee78d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.

 2) Several spots need to get to the original listner for SYN-ACK
    packets, most spots got this ok but some were not.  Whilst covering
    the remaining cases, create a helper to do this.  From Eric Dumazet.

 3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
    from Rasmus Villemoes.

 4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.

 5) Use after free in mvneta driver, from Justin Maggard.

 6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.

 7) Add missing ZLIB_INFLATE dependency for new qed driver.  From Arnd
    Bergmann.

 8) Fix multicast getsockopt deadlock, from WANG Cong.

 9) Fix deadlock in btusb, from Kuba Pawlak.

10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
    counter state.  From Sabrina Dubroca.

11) Fix packet_bind() race, which can cause lost notifications, from
    Francesco Ruggeri.

12) Fix MAC restoration in qlcnic driver during bonding mode changes,
    from Jarod Wilson.

13) Revert bridging forward delay change which broke libvirt and other
    userspace things, from Vlad Yasevich.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  Revert "bridge: Allow forward delay to be cfgd when STP enabled"
  bpf_trace: Make dependent on PERF_EVENTS
  qed: select ZLIB_INFLATE
  net: fix a race in dst_release()
  net: mvneta: Fix memory use after free.
  net: Documentation: Fix default value tcp_limit_output_bytes
  macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
  mvneta: add FIXED_PHY dependency
  net: caif: check return value of alloc_netdev
  net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
  drivers: net: xgene: fix RGMII 10/100Mb mode
  netfilter: nft_meta: use skb_to_full_sk() helper
  net_sched: em_meta: use skb_to_full_sk() helper
  sched: cls_flow: use skb_to_full_sk() helper
  netfilter: xt_owner: use skb_to_full_sk() helper
  smack: use skb_to_full_sk() helper
  net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
  bpf: doc: correct arch list for supported eBPF JIT
  dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
  bonding: fix panic on non-ARPHRD_ETHER enslave failure
  ...
2015-11-10 18:11:41 -08:00
Eric Dumazet
d69bbf88c8 net: fix a race in dst_release()
Only cpu seeing dst refcount going to 0 can safely
dereference dst->flags.

Otherwise an other cpu might already have freed the dst.

Fixes: 27b75c95f1 ("net: avoid RCU for NOCACHE dst")
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-09 21:55:48 -05:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Jarod Wilson
e7868a85e1 net/core: ensure features get disabled on new lower devs
With moving netdev_sync_lower_features() after the .ndo_set_features
calls, I neglected to verify that devices added *after* a flag had been
disabled on an upper device were properly added with that flag disabled as
well. This currently happens, because we exit __netdev_update_features()
when we see dev->features == features for the upper dev. We can retain the
optimization of leaving without calling .ndo_set_features with a bit of
tweaking and a goto here.

Fixes: fd867d51f8 ("net/core: generic support for disabling netdev features down stack")
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04 21:56:00 -05:00
Jarod Wilson
5ba3f7d61a net/core: fix for_each_netdev_feature
As pointed out by Nikolay and further explained by Geert, the initial
for_each_netdev_feature macro was broken, as feature would get set outside
of the block of code it was intended to run in, thus only ever working for
the first feature bit in the mask. While less pretty this way, this is
tested and confirmed functional with multiple feature bits set in
NETIF_F_UPPER_DISABLES.

[root@dell-per730-01 ~]# ethtool -K bond0 lro off
...
[  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

[root@dell-per730-01 ~]# ethtool -K bond0 gro off
...
[  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
[  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
[  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

Fixes: fd867d51f ("net/core: generic support for disabling netdev features down stack")
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Geert Uytterhoeven <geert@linux-m68k.org>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 11:29:57 -05:00
Stefan Sørensen
5f94c943d5 ptp: Change ptp_class to a proper bitmask
Change the definition of PTP_CLASS_L2 to not have any bits overlapping with
the other defined protocol values, allowing the PTP_CLASS_* definitions to
be for simple filtering on packet type.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 11:08:22 -05:00
Jarod Wilson
fd867d51f8 net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.

This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.

Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.

[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off

dmesg dump:

[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.

Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.

Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:41:31 -05:00
Eric Dumazet
9e17f8a475 net: make skb_set_owner_w() more robust
skb_set_owner_w() is called from various places that assume
skb->sk always point to a full blown socket (as it changes
sk->sk_wmem_alloc)

We'd like to attach skb to request sockets, and in the future
to timewait sockets as well. For these kind of pseudo sockets,
we need to take a traditional refcount and use sock_edemux()
as the destructor.

It is now time to un-inline skb_set_owner_w(), being too big.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 16:28:49 -05:00
Hannes Frederic Sowa
080a270f5a sock: don't enable netstamp for af_unix sockets
netstamp_needed is toggled for all socket families if they request
timestamping. But some protocols don't need the lower-layer timestamping
code at all. This patch starts disabling it for af-unix.

E.g. systemd enables timestamping during boot-up on the journald af-unix
sockets, thus causing the system to globally enable timestamping in the
lower networking stack. Still, it is very probable that timestamping
gets activated, by e.g. dhclient or various NTP implementations.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:39:14 -07:00
emmanuel.grumbach@intel.com
8941faa161 net: tso: add support for IPv6
Adding IPv6 for the TSO helper API is trivial:
* Don't play with the id (which doesn't exist in IPv6)
* Correctly update the payload_len (don't include the
  length of the IP header itself)

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26 22:24:22 -07:00
David S. Miller
ba3e2084f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/xfrm6_output.c
	net/openvswitch/flow_netlink.c
	net/openvswitch/vport-gre.c
	net/openvswitch/vport-vxlan.c
	net/openvswitch/vport.c
	net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes.  One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:54:12 -07:00