Commit graph

1430 commits

Author SHA1 Message Date
David Keitel
f2b1fed1bd Merge remote-tracking branch 'lsk-44/linux-linaro-lsk-v4.4' into 44rc2
* lsk-44/linux-linaro-lsk-v4.4:
  Linux 4.4.3
  modules: fix modparam async_probe request
  module: wrapper for symbol name.
  itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
  posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
  timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
  prctl: take mmap sem for writing to protect against others
  xfs: log mount failures don't wait for buffers to be released
  Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
  xfs: inode recovery readahead can race with inode buffer creation
  libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
  ovl: setattr: check permissions before copy-up
  ovl: root: copy attr
  ovl: check dentry positiveness in ovl_cleanup_whiteouts()
  ovl: use a minimal buffer in ovl_copy_xattr
  ovl: allow zero size xattr
  futex: Drop refcount if requeue_pi() acquired the rtmutex
  devm_memremap_release(): fix memremap'd addr handling
  ipc/shm: handle removed segments gracefully in shm_mmap()
  intel_scu_ipcutil: underflow in scu_reg_access()
  mm,thp: khugepaged: call pte flush at the time of collapse
  dump_stack: avoid potential deadlocks
  radix-tree: fix oops after radix_tree_iter_retry
  drivers/hwspinlock: fix race between radix tree insertion and lookup
  radix-tree: fix race in gang lookup
  MAINTAINERS: return arch/sh to maintained state, with new maintainers
  memcg: only free spare array when readers are done
  numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390
  fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list()
  scripts/bloat-o-meter: fix python3 syntax error
  dma-debug: switch check from _text to _stext
  m32r: fix m32104ut_defconfig build fail
  xhci: Fix list corruption in urb dequeue at host removal
  Revert "xhci: don't finish a TD if we get a short-transfer event mid TD"
  iommu/vt-d: Clear PPR bit to ensure we get more page request interrupts
  iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG
  iommu/vt-d: Fix mm refcounting to hold mm_count not mm_users
  iommu/amd: Correct the wrong setting of alias DTE in do_attach
  iommu/vt-d: Don't skip PCI devices when disabling IOTLB
  Input: vmmouse - fix absolute device registration
  string_helpers: fix precision loss for some inputs
  Input: i8042 - add Fujitsu Lifebook U745 to the nomux list
  Input: elantech - mark protocols v2 and v3 as semi-mt
  mm: fix regression in remap_file_pages() emulation
  mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
  mm: fix mlock accouting
  libnvdimm: fix namespace object confusion in is_uuid_busy()
  mm: soft-offline: check return value in second __get_any_page() call
  perf kvm record/report: 'unprocessable sample' error while recording/reporting guest data
  KVM: PPC: Fix ONE_REG AltiVec support
  KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
  KVM: arm/arm64: Fix reference to uninitialised VGIC
  arm64: dma-mapping: fix handling of devices registered before arch_initcall
  ARM: OMAP2+: Fix ppa_zero_params and ppa_por_params for rodata
  ARM: OMAP2+: Fix save_secure_ram_context for rodata
  ARM: OMAP2+: Fix l2dis_3630 for rodata
  ARM: OMAP2+: Fix l2_inv_api_params for rodata
  ARM: OMAP2+: Fix wait_dll_lock_timed for rodata
  ARM: dts: at91: sama5d4ek: add phy address and IRQ for macb0
  ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type
  ARM: dts: at91: sama5d4: fix instance id of DBGU
  ARM: dts: at91: sama5d4 xplained: properly mux phy interrupt
  ARM: dts: omap5-board-common: enable rtc and charging of backup battery
  ARM: dts: Fix omap5 PMIC control lines for RTC writes
  ARM: dts: Fix wl12xx missing clocks that cause hangs
  ARM: nomadik: fix up SD/MMC DT settings
  ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz()
  ARM: 8519/1: ICST: try other dividends than 1
  arm64: mm: avoid calling apply_to_page_range on empty range
  ARM: mvebu: remove duplicated regulator definition in Armada 388 GP
  powerpc/ioda: Set "read" permission when "write" is set
  powerpc/powernv: Fix stale PE primary bus
  powerpc/eeh: Fix stale cached primary bus
  powerpc/eeh: Fix PE location code
  SUNRPC: Fixup socket wait for memory
  udf: Check output buffer length when converting name to CS0
  udf: Prevent buffer overrun with multi-byte characters
  udf: limit the maximum number of indirect extents in a row
  pNFS/flexfiles: Fix an XDR encoding bug in layoutreturn
  nfs: Fix race in __update_open_stateid()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  cifs: fix erroneous return value
  cifs_dbg() outputs an uninitialized buffer in cifs_readdir()
  cifs: fix race between call_async() and reconnect()
  cifs: Ratelimit kernel log messages
  iio: inkern: fix a NULL dereference on error
  iio: pressure: mpl115: fix temperature offset sign
  iio: light: acpi-als: Report data as processed
  iio: dac: mcp4725: set iio name property in sysfs
  iio: add IIO_TRIGGER dependency to STK8BA50
  iio: add HAS_IOMEM dependency to VF610_ADC
  iio-light: Use a signed return type for ltr501_match_samp_freq()
  iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer.
  iio: adis_buffer: Fix out-of-bounds memory access
  scsi: fix soft lockup in scsi_remove_target() on module removal
  SCSI: Add Marvell Console to VPD blacklist
  scsi_dh_rdac: always retry MODE SELECT on command lock violation
  drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
  SCSI: fix crashes in sd and sr runtime PM
  iscsi-target: Fix potential dead-lock during node acl delete
  scsi: add Synology to 1024 sector blacklist
  klist: fix starting point removed bug in klist iterators
  tracepoints: Do not trace when cpu is offline
  tracing: Fix freak link error caused by branch tracer
  perf tools: tracepoint_error() can receive e=NULL, robustify it
  tools lib traceevent: Fix output of %llu for 64 bit values read on 32 bit machines
  ptrace: use fsuid, fsgid, effective creds for fs access checks
  Btrfs: fix direct IO requests not reporting IO error to user space
  Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
  Btrfs: fix page reading in extent_same ioctl leading to csum errors
  Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
  btrfs: properly set the termination value of ctx->pos in readdir
  Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
  Btrfs: fix fitrim discarding device area reserved for boot loader's use
  btrfs: handle invalid num_stripes in sys_array
  ext4: don't read blocks from disk after extents being swapped
  ext4: fix potential integer overflow
  ext4: fix scheduling in atomic on group checksum failure
  serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485)
  serial: 8250_pci: Add Intel Broadwell ports
  tty: Add support for PCIe WCH382 2S multi-IO card
  pty: make sure super_block is still valid in final /dev/tty close
  pty: fix possible use after free of tty->driver_data
  staging/speakup: Use tty_ldisc_ref() for paste kworker
  phy: twl4030-usb: Fix unbalanced pm_runtime_enable on module reload
  phy: twl4030-usb: Relase usb phy on unload
  ALSA: seq: Fix double port list deletion
  ALSA: seq: Fix leak of pool buffer at concurrent writes
  ALSA: pcm: Fix rwsem deadlock for non-atomic PCM stream
  ALSA: hda - Cancel probe work instead of flush at remove
  x86/mm: Fix vmalloc_fault() to handle large pages properly
  x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache()
  x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable
  x86/mm/pat: Avoid truncation when converting cpa->numpages to address
  x86/mm: Fix types used in pgprot cacheability flags translations
  Linux 4.4.2
  HID: multitouch: fix input mode switching on some Elan panels
  mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
  zsmalloc: fix migrate_zspage-zs_free race condition
  zram: don't call idr_remove() from zram_remove()
  zram: try vmalloc() after kmalloc()
  zram/zcomp: use GFP_NOIO to allocate streams
  rtlwifi: rtl8821ae: Fix 5G failure when EEPROM is incorrectly encoded
  rtlwifi: rtl8821ae: Fix errors in parameter initialization
  crypto: marvell/cesa - fix test in mv_cesa_dev_dma_init()
  crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts
  crypto: atmel-sha - fix atmel_sha_remove()
  crypto: algif_skcipher - Do not set MAY_BACKLOG on the async path
  crypto: algif_skcipher - Do not dereference ctx without socket lock
  crypto: algif_skcipher - Do not assume that req is unchanged
  crypto: user - lock crypto_alg_list on alg dump
  EVM: Use crypto_memneq() for digest comparisons
  crypto: algif_hash - wait for crypto_ahash_init() to complete
  crypto: shash - Fix has_key setting
  crypto: chacha20-ssse3 - Align stack pointer to 64 bytes
  crypto: caam - make write transactions bufferable on PPC platforms
  crypto: algif_skcipher - sendmsg SG marking is off by one
  crypto: algif_skcipher - Load TX SG list after waiting
  crypto: crc32c - Fix crc32c soft dependency
  crypto: algif_skcipher - Fix race condition in skcipher_check_key
  crypto: algif_hash - Fix race condition in hash_check_key
  crypto: af_alg - Forbid bind(2) when nokey child sockets are present
  crypto: algif_skcipher - Remove custom release parent function
  crypto: algif_hash - Remove custom release parent function
  crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path
  ahci: Intel DNV device IDs SATA
  libata: disable forced PORTS_IMPL for >= AHCI 1.3
  crypto: algif_skcipher - Add key check exception for cipher_null
  crypto: skcipher - Add crypto_skcipher_has_setkey
  crypto: algif_hash - Require setkey before accept(2)
  crypto: hash - Add crypto_ahash_has_setkey
  crypto: algif_skcipher - Add nokey compatibility path
  crypto: af_alg - Add nokey compatibility path
  crypto: af_alg - Fix socket double-free when accept fails
  crypto: af_alg - Disallow bind/setkey/... after accept(2)
  crypto: algif_skcipher - Require setkey before accept(2)
  sched: Fix crash in sched_init_numa()
  ext4 crypto: add missing locking for keyring_key access
  iommu/io-pgtable-arm: Ensure we free the final level on teardown
  tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
  tty: Retry failed reopen if tty teardown in-progress
  tty: Wait interruptibly for tty lock on reopen
  n_tty: Fix unsafe reference to "other" ldisc
  usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms
  usb: xhci: handle both SSIC ports in PME stuck quirk
  usb: phy: msm: fix error handling in probe.
  usb: cdc-acm: send zero packet for intel 7260 modem
  usb: cdc-acm: handle unlinked urb in acm read callback
  USB: option: fix Cinterion AHxx enumeration
  USB: serial: option: Adding support for Telit LE922
  USB: cp210x: add ID for IAI USB to RS485 adaptor
  USB: serial: ftdi_sio: add support for Yaesu SCU-18 cable
  usb: hub: do not clear BOS field during reset device
  USB: visor: fix null-deref at probe
  USB: serial: visor: fix crash on detecting device without write_urbs
  ASoC: rt5645: fix the shift bit of IN1 boost
  saa7134-alsa: Only frees registered sound cards
  ALSA: dummy: Implement timer backend switching more safely
  ALSA: hda - Fix bad dereference of jack object
  ALSA: hda - Fix speaker output from VAIO AiO machines
  Revert "ALSA: hda - Fix noise on Gigabyte Z170X mobo"
  ALSA: hda - Fix static checker warning in patch_hdmi.c
  ALSA: hda - Add fixup for Mac Mini 7,1 model
  ALSA: timer: Fix race between stop and interrupt
  ALSA: timer: Fix wrong instance passed to slave callbacks
  ALSA: timer: Fix race at concurrent reads
  ALSA: timer: Fix link corruption due to double start or stop
  ALSA: timer: Fix leftover link at closing
  ALSA: timer: Code cleanup
  ALSA: seq: Fix lockdep warnings due to double mutex locks
  ALSA: seq: Fix race at closing in virmidi driver
  ALSA: seq: Fix yet another races among ALSA timer accesses
  ASoC: dpcm: fix the BE state on hw_free
  ALSA: pcm: Fix potential deadlock in OSS emulation
  ALSA: hda/realtek - Support Dell headset mode for ALC225
  ALSA: hda/realtek - Support headset mode for ALC225
  ALSA: hda/realtek - New codec support of ALC225
  ALSA: rawmidi: Fix race at copying & updating the position
  ALSA: rawmidi: Remove kernel WARNING for NULL user-space buffer check
  ALSA: rawmidi: Make snd_rawmidi_transmit() race-free
  ALSA: seq: Degrade the error message for too many opens
  ALSA: seq: Fix incorrect sanity check at snd_seq_oss_synth_cleanup()
  ALSA: dummy: Disable switching timer backend via sysfs
  ALSA: compress: Disable GET_CODEC_CAPS ioctl for some architectures
  ALSA: hda - disable dynamic clock gating on Broxton before reset
  ALSA: Add missing dependency on CONFIG_SND_TIMER
  ALSA: bebob: Use a signed return type for get_formation_index
  ALSA: usb-audio: avoid freeing umidi object twice
  ALSA: usb-audio: Add native DSD support for PS Audio NuWave DAC
  ALSA: usb-audio: Fix OPPO HA-1 vendor ID
  ALSA: usb-audio: Add quirk for Microsoft LifeCam HD-6000
  ALSA: usb-audio: Fix TEAC UD-501/UD-503/NT-503 usb delay
  hrtimer: Handle remaining time proper for TIME_LOW_RES
  md/raid: only permit hot-add of compatible integrity profiles
  media: i2c: Don't export ir-kbd-i2c module alias
  parisc: Fix __ARCH_SI_PREAMBLE_SIZE
  parisc: Protect huge page pte changes with spinlocks
  printk: do cond_resched() between lines while outputting to consoles
  tracing/stacktrace: Show entire trace if passed in function not found
  tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
  PCI: Fix minimum allocation address overwrite
  PCI: host: Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD
  mtd: nand: assign reasonable default name for NAND drivers
  wlcore/wl12xx: spi: fix NULL pointer dereference (Oops)
  wlcore/wl12xx: spi: fix oops on firmware load
  ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
  ocfs2/dlm: ignore cleaning the migration mle that is inuse
  ALSA: hda - Implement loopback control switch for Realtek and other codecs
  block: fix bio splitting on max sectors
  base/platform: Fix platform drivers with no probe callback
  HID: usbhid: fix recursive deadlock
  ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock
  block: split bios to max possible length
  NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
  crypto: sun4i-ss - add missing statesize
  Linux 4.4.1
  arm64: kernel: fix architected PMU registers unconditional access
  arm64: kernel: enforce pmuserenr_el0 initialization and restore
  arm64: mm: ensure that the zero page is visible to the page table walker
  arm64: Clear out any singlestep state on a ptrace detach operation
  powerpc/module: Handle R_PPC64_ENTRY relocations
  scripts/recordmcount.pl: support data in text section on powerpc
  powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered
  powerpc: Make value-returning atomics fully ordered
  powerpc/tm: Check for already reclaimed tasks
  batman-adv: Drop immediate orig_node free function
  batman-adv: Drop immediate batadv_hard_iface free function
  batman-adv: Drop immediate neigh_ifinfo free function
  batman-adv: Drop immediate batadv_neigh_node free function
  batman-adv: Drop immediate batadv_orig_ifinfo free function
  batman-adv: Avoid recursive call_rcu for batadv_nc_node
  batman-adv: Avoid recursive call_rcu for batadv_bla_claim
  team: Replace rcu_read_lock with a mutex in team_vlan_rx_kill_vid
  net/mlx5_core: Fix trimming down IRQ number
  bridge: fix lockdep addr_list_lock false positive splat
  ipv6: update skb->csum when CE mark is propagated
  net: bpf: reject invalid shifts
  phonet: properly unshare skbs in phonet_rcv()
  dwc_eth_qos: Fix dma address for multi-fragment skbs
  bonding: Prevent IPv6 link local address on enslaved devices
  net: preserve IP control block during GSO segmentation
  udp: disallow UFO for sockets with SO_NO_CHECK option
  net: pktgen: fix null ptr deref in skb allocation
  sched,cls_flower: set key address type when present
  tcp_yeah: don't set ssthresh below 2
  ipv6: tcp: add rcu locking in tcp_v6_send_synack()
  net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory
  vxlan: fix test which detect duplicate vxlan iface
  unix: properly account for FDs passed over unix sockets
  xhci: refuse loading if nousb is used
  usb: core: lpm: fix usb3_hardware_lpm sysfs node
  USB: cp210x: add ID for ELV Marble Sound Board 1
  rtlwifi: fix memory leak for USB device
  ASoC: compress: Fix compress device direction check
  ASoC: wm5110: Fix PGA clear when disabling DRE
  ALSA: timer: Handle disconnection more safely
  ALSA: hda - Flush the pending probe work at remove
  ALSA: hda - Fix missing module loading with model=generic option
  ALSA: hda - Fix bass pin fixup for ASUS N550JX
  ALSA: control: Avoid kernel warnings from tlv ioctl with numid 0
  ALSA: hrtimer: Fix stall by hrtimer_cancel()
  ALSA: pcm: Fix snd_pcm_hw_params struct copy in compat mode
  ALSA: seq: Fix snd_seq_call_port_info_ioctl in compat mode
  ALSA: hda - Add fixup for Dell Latitidue E6540
  ALSA: timer: Fix double unlink of active_list
  ALSA: timer: Fix race among timer ioctls
  ALSA: hda - fix the headset mic detection problem for a Dell laptop
  ALSA: timer: Harden slave timer list handling
  ALSA: usb-audio: Fix mixer ctl regression of Native Instrument devices
  ALSA: hda - Fix white noise on Dell Latitude E5550
  ALSA: seq: Fix race at timer setup and close
  ALSA: usb-audio: Avoid calling usb_autopm_put_interface() at disconnect
  ALSA: seq: Fix missing NULL check at remove_events ioctl
  ALSA: hda - Fixup inverted internal mic for Lenovo E50-80
  ALSA: usb: Add native DSD support for Oppo HA-1
  x86/mm: Improve switch_mm() barrier comments
  x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
  x86/boot: Double BOOT_HEAP_SIZE to 64KB
  x86/reboot/quirks: Add iMac10,1 to pci_reboot_dmi_table[]
  kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
  KVM: x86: correctly print #AC in traces
  KVM: x86: expose MSR_TSC_AUX to userspace
  x86/xen: don't reset vcpu_info on a cancelled suspend
  KEYS: Fix keyring ref leak in join_session_keyring()

Conflicts:
	arch/arm64/kernel/perf_event.c
	drivers/scsi/sd.c
	sound/core/compress_offload.c

Change-Id: I9f77fe42aaae249c24cd6e170202110ab1426878
Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
2016-03-23 20:51:00 -07:00
Harout Hedeshian
d5f660635e net: sched: gracefully handle qdisc which don't support change()
tc_qdisc_flow_control() incorrectly assumes that all queue disciplines
implement the change() callback in Qdisc_ops. If the flow control function
is called a queue that does not implement change(), we try to jump to a
null function pointer resulting in a panic. The flow control function
does require that the queue discipline support change() in order to
enable/disable the queue. Now we will check if the queue to do flow
control on supports the required funcionality and will WARN_ONCE and
return if the requirements are not met.

[<ffffffc000c5c028>] panic+0x28c
[<ffffffc0000889f4>] die+0x238
[<ffffffc000088a6c>] arm64_notify_die+0x4c
[<ffffffc000088d00>] bad_mode+0xa8
[<ffffffc000ab9ab8>] tc_qdisc_flow_control+0x98
[<ffffffc000c506e8>] rmnet_vnd_ioctl+0x230
[<ffffffc000aac6fc>] dev_ifsioc+0x2a0
[<ffffffc000aacde8>] dev_ioctl+0x644
[<ffffffc000a7db4c>] sock_ioctl+0x4c
[<ffffffc0001af548>] do_vfs_ioctl+0x4b4
[<ffffffc0001af684>] SyS_ioctl+0x60
[<ffffffc0000854b0>] el0_svc_naked+0x24

Change-Id: I85ca324e57814b6c0eac48d2e076e861ab32c260
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
2016-03-22 11:09:47 -07:00
Tianyi Gou
12fae71b18 net: sched: Schedule PRIO qdisc when flow control released
The PRIO qdisc supports flow control, such that packet
dequeue can be disabled based on boolean flag 'enable_flow'.
When flow is re-enabled, the latency for new packets
arriving at network driver is high.  To reduce the delay in
scheduling packets, the qdisc will now invoke
__netif_schedule() to expedite dequeue.  This significantly
reduces the latency of packets arriving at network driver.

Change-Id: Ic5fe3faf86f177300d3018b9f60974ba3811641c
CRs-Fixed: 355156
Acked-by: Jimi Shah <jimis@qualcomm.com>
Signed-off-by: Tianyi Gou <tgou@codeaurora.org>
2016-03-22 11:05:55 -07:00
Harout Hedeshian
6ff0325a57 net: tc_qdisc_flow_control returning qdisc size
Changed the tc_qdisc_flow_control API to return the size of the
qdisc in order to be able to collect data on the size of the
qdisc before doing flow control operations. This is required to effectively
diagnose the state of the queues when debugging flow control.

CRs-Fixed: 657414
Change-Id: I4aff2157410e1170de2d0791757ed2e12830a2db
Acked-by: Sivan Reinstein <sivanr@qti.qualcomm.com>
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:05:54 -07:00
Tianyi Gou
a4dc732269 net_sched: Add flow control support to prio qdisc
Add enable_flow flag to the prio qdisc.  Packet flow is
enabled by default, but can be disabled from userspace
(e.g. IPROUTE2 tc tool).  This allows for suspending packet
dequeue on a per-qdisc basis, which is needed to supprot
Quality of Service (QOS) when using WWAN modem.

Change-Id: I932f296be946f1acc3b00c7d8569bbb733d33622
Acked-by: Andrew Richardson <randrew@qualcomm.com>
CRs-Fixed: 283471
Signed-off-by: Tianyi Gou <tgou@codeaurora.org>
2016-03-22 11:05:53 -07:00
Tianyi Gou
84b4c1cc42 net: sched: export an api to enable/disable flow on sch
Export a function from sch_api.c that will look up
desired qdisc and call it's registered change function
to enable/disable flow.

Change-Id: I5b6dc7a6fd2b09b796c92b3770ba83423d19c864
CRs-Fixed: 355156
Acked-by: Jimi Shah <jimis@qualcomm.com>
Signed-off-by: Tianyi Gou <tgou@codeaurora.org>
[subashab@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22 11:05:53 -07:00
Jamal Hadi Salim
8ece807074 sched,cls_flower: set key address type when present
[ Upstream commit 66530bdf85eb1d72a0c399665e09a2c2298501c6 ]

only when user space passes the addresses should we consider their
presence

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:59 -08:00
John Fastabend
73c20a8b72 net: sched: fix missing free per cpu on qstats
When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f9275 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:40:21 -05:00
Eric Dumazet
225734de70 net_sched: make qdisc_tree_decrease_qlen() work for non mq
Stas Nichiporovich reported a regression in his HFSC qdisc setup
on a non multi queue device.

It turns out I mistakenly added a TCQ_F_NOPARENT flag on all qdisc
allocated in qdisc_create() for non multi queue devices, which was
rather buggy. I was clearly mislead by the TCQ_F_ONETXQUEUE that is
also set here for no good reason, since it only matters for the root
qdisc.

Fixes: 4eaf3b84f2 ("net_sched: fix qdisc_tree_decrease_qlen() races")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 13:41:52 -05:00
Eric Dumazet
4eaf3b84f2 net_sched: fix qdisc_tree_decrease_qlen() races
qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[  681.774821] PAX: sch->q.qlen: 0 n: 1
[  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
[  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[  681.774962] Call Trace:
[  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
[  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
[  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
[  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
[  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
[  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
[  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
[  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
[  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
[  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
[  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
[  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini <dfucini@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 14:59:05 -05:00
Eric Dumazet
02a56c81cf net_sched: em_meta: use skb_to_full_sk() helper
SYNACK packets might be attached to request sockets.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-08 20:56:39 -05:00
Eric Dumazet
743b2a6674 sched: cls_flow: use skb_to_full_sk() helper
SYNACK packets might be attached to request sockets.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-08 20:56:39 -05:00
Phil Sutter
2a4f417621 net: sched: kill dead code in sch_choke.c
It looks like this has never been used at all.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 13:30:47 -05:00
David S. Miller
26440c835f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-20 06:08:27 -07:00
Eric Dumazet
e446f9dfe1 net: synack packets can be attached to request sockets
selinux needs few changes to accommodate fact that SYNACK messages
can be attached to a request socket, lacking sk_security pointer

(Only syncookies are still attached to a TCP_LISTEN socket)

Adds a new sk_listener() helper, and use it in selinux and sch_fq

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported by: kernel test robot <ying.huang@linux.intel.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11 05:05:06 -07:00
WANG Cong
6ac644a8ae sch_hhf: fix return value of hhf_drop()
Similar to commit c0afd9ce4d ("fq_codel: fix return value of fq_codel_drop()")
->drop() is supposed to return the number of bytes it dropped,
but hhf_drop () returns the id of the bucket where it drops
a packet from.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Terry Lam <vtlam@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11 04:49:33 -07:00
Paul Gortmaker
075640e364 net/sched: make sch_blackhole.c explicitly non-modular
The Kconfig currently controlling compilation of this code is:

net/sched/Kconfig:menuconfig NET_SCHED
net/sched/Kconfig:      bool "QoS and/or fair queueing"

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.

We also delete the MODULE_LICENSE tag since all that information
is already contained at the top of the file in the comments.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-09 07:52:28 -07:00
WANG Cong
d40496a564 act_mirred: clear sender cpu before sending to tx
Similar to commit c29390c6df ("xps: must clear sender_cpu before forwarding")
the skb->sender_cpu needs to be cleared when moving from Rx
Tx, otherwise kernel could crash.

Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:59:04 -07:00
WANG Cong
215c90afb9 act_mirred: always release tcf hash
Align with other tc actions.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 06:30:34 -07:00
WANG Cong
6bd00b8506 act_mirred: fix a race condition on mirred_list
After commit 1ce87720d4 ("net: sched: make cls_u32 lockless")
we began to release tc actions in a RCU callback. However,
mirred action relies on RTNL lock to protect the global
mirred_list, therefore we could have a race condition
between RCU callback and netdevice event, which caused
a list corruption as reported by Vinson.

Instead of relying on RTNL lock, introduce a spinlock to
protect this list.

Note, in non-bind case, it is still called with RTNL lock,
therefore should disable BH too.

Reported-by: Vinson Lee <vlee@twopensource.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 06:30:33 -07:00
Daniel Borkmann
c46646d048 sched, bpf: add helper for retrieving routing realms
Using routing realms as part of the classifier is quite useful, it
can be viewed as a tag for one or multiple routing entries (think of
an analogy to net_cls cgroup for processes), set by user space routing
daemons or via iproute2 as an indicator for traffic classifiers and
later on processed in the eBPF program.

Unlike actions, the classifier can inspect device flags and enable
netif_keep_dst() if necessary. tc actions don't have that possibility,
but in case people know what they are doing, it can be used from there
as well (e.g. via devs that must keep dsts by design anyway).

If a realm is set, the handler returns the non-zero realm. User space
can set the full 32bit realm for the dst.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:41 -07:00
Eric Dumazet
ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
David S. Miller
4963ed48f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/arp.c

The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 16:08:27 -07:00
WANG Cong
d8aecb1011 net: revert "net_sched: move tp->root allocation into fw_init()"
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:33:30 -07:00
Daniel Borkmann
5cf8ca0e47 cls_bpf: further limit exec opcodes subset
Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.

Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann
ef146fa40c cls_bpf: make binding to classid optional
The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.

Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).

An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann
bf007d1c75 cls_bpf: also dump TCA_BPF_FLAGS
In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.

Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:01 -07:00
Eric W. Biederman
a31f1adc09 netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple
As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:04 +02:00
Eric W. Biederman
a4ffe319ae act_connmark: Remember the struct net instead of guessing it.
Stop guessing the struct net instead of remember it.  Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:59:31 +02:00
Eric W. Biederman
156c196f60 netfilter: x_tables: Pass struct net in xt_action_param
As xt_action_param lives on the stack this does not bloat any
persistent data structures.

This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:14 +02:00
Eric Dumazet
47bbbb30b4 sch_dsmark: improve memory locality
Memory placement in sch_dsmark is silly : Better place mask/value
in the same cache line.

Also, we can embed small arrays in the first cache line and
remove a potential cache miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:37:19 -07:00
Alexei Starovoitov
27b29f6305 bpf: add bpf_redirect() helper
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:09:07 -07:00
Daniel Borkmann
045efa82ff cls_bpf: introduce integrated actions
Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb->protocol) {
  case htons(ETH_P_ARP):
    skb->tc_classid = 1;
    break;
  case htons(ETH_P_IP):
    skb->tc_classid = 2;
    break;
  case htons(ETH_P_IPV6):
    skb->tc_classid = 3;
    break;
  default:
    return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:09:06 -07:00
Tom Herbert
cd79a2382a flow_dissector: Add flags argument to skb_flow_dissector functions
The flags argument will allow control of the dissection process (for
instance whether to parse beyond L3).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-01 15:06:22 -07:00
Daniel Borkmann
c1b3b19923 net: sched: don't break line in tc_classify loop notification
Just some minor noise follow-up to address some stylistic issues of
commit 3b3ae88026 ("net: sched: consolidate tc_classify{,_compat}").
Accidentally v1 instead of v2 of that commit got applied, so this
patch adds the relative diff.

Suggested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 14:01:15 -07:00
David S. Miller
0d36938bb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-27 21:45:31 -07:00
Phil Sutter
3e692f2153 net: sched: simplify attach_one_default_qdisc()
Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:30 -07:00
Phil Sutter
d66d6c3152 net: sched: register noqueue qdisc
This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:30 -07:00
Phil Sutter
db4094bca7 net: sched: ignore tx_queue_len when assigning default qdisc
Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev->priv flags always contains IFF_NO_QUEUE.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:29 -07:00
Daniel Borkmann
3b3ae88026 net: sched: consolidate tc_classify{,_compat}
For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.

CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.

We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 14:18:48 -07:00
Alexei Starovoitov
cff82457c5 net_sched: act_bpf: remove spinlock in fast path
Similar to act_gact/act_mirred, act_bpf can be lockless in packet processing
with extra care taken to free bpf programs after rcu grace period.
Replacement of existing act_bpf (very rare) is done with synchronize_rcu()
and final destruction is done from tc_action_ops->cleanup() callback that is
called from tcf_exts_destroy()->tcf_action_destroy()->__tcf_hash_release() when
bind and refcnt reach zero which is only possible when classifier is destroyed.
Previous two patches fixed the last two classifiers (tcindex and rsvp) to
call tcf_exts_destroy() from rcu callback.

Similar to gact/mirred there is a race between prog->filter and
prog->tcf_action. Meaning that the program being replaced may use
previous default action if it happened to return TC_ACT_UNSPEC.
act_mirred race betwen tcf_action and tcfm_dev is similar.
In all cases the race is harmless.
Long term we may want to improve the situation by replacing the whole
tc_action->priv as single pointer instead of updating inner fields one by one.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:45 -07:00
Alexei Starovoitov
9e528d8915 net_sched: convert rsvp to call tcf_exts_destroy from rcu callback
Adjust destroy path of cls_rsvp to call tcf_exts_destroy() after
rcu grace period.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:45 -07:00
Alexei Starovoitov
ed7aa879ce net_sched: convert tcindex to call tcf_exts_destroy from rcu callback
Adjust destroy path of cls_tcindex to call tcf_exts_destroy() after
rcu grace period.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
Alexei Starovoitov
faa54be4c7 net_sched: act_bpf: remove unnecessary copy
Fix harmless typo and avoid unnecessary copy of empty 'prog' into
unused 'strcut tcf_bpf_cfg old'.

Fixes: f4eaed28c7 ("act_bpf: fix memory leaks when replacing bpf programs")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
Alexei Starovoitov
3c645621b7 net_sched: make tcf_hash_destroy() static
tcf_hash_destroy() used once. Make it static.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26 11:01:44 -07:00
WANG Cong
a6c1aea044 cls_u32: complete the check for non-forced case in u32_destroy()
In commit 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
I added a check in u32_destroy() to see if all real filters are gone
for each tp, however, that is only done for root_ht, same is needed
for others.

This can be reproduced by the following tc commands:

tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
ht 15:2: match ip src 10.0.0.2 flowid 1:10
tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
ht 15:2: match ip src 10.0.0.3 flowid 1:10

Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 17:02:48 -07:00
Pablo Neira Ayuso
81bf1c64e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflicts with conntrack template fixes.

Conflicts:
	net/netfilter/nf_conntrack_core.c
	net/netfilter/nf_synproxy_core.c
	net/netfilter/xt_CT.c

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21 06:09:05 +02:00
Phil Sutter
348e3435cb net: sched: drop all special handling of tx_queue_len == 0
Those were all workarounds for the formerly double meaning of
tx_queue_len, which broke scheduling algorithms if untreated.

Now that all in-tree drivers have been converted away from setting
tx_queue_len = 0, it should be safe to drop these workarounds for
categorically broken setups.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:08 -07:00
Tom Herbert
4b048d6d9d net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool
inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
the checksum field carries a pseudo header. This argument should be a
boolean instead of an int.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 21:33:06 -07:00
Daniel Borkmann
deedb59039 netfilter: nf_conntrack: add direction support for zones
This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.

The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.

As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.

Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.

If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).

Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.

Below a minimal, simplified collision example (script in [2]) with
netperf sessions:

  +--- tenant-1 ---+   mark := 1
  |    netperf     |--+
  +----------------+  |                CT zone := mark [ORIGINAL]
   [ip,sport] := X   +--------------+  +--- gateway ---+
                     | mark routing |--|     SNAT      |-- ... +
                     +--------------+  +---------------+       |
  +--- tenant-2 ---+  |                                     ~~~|~~~
  |    netperf     |--+                +-----------+           |
  +----------------+   mark := 2       | netserver |------ ... +
   [ip,sport] := X                     +-----------+
                                        [ip,port] := Y
On the gateway netns, example:

  iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
  iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully

  iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
  iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

conntrack dump from gateway netns:

  netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

  tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                        src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                        src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.

I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.

  [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
  [2] https://paste.fedoraproject.org/242835/65657871/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18 01:22:50 +02:00