Pablo Neira Ayuso says:
====================
Netfilter/nftables/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes, mostly nftables
fixes, most relevantly they are:
* Fix a crash in the h323 conntrack NAT helper due to expectation list
corruption, from Alexey Dobriyan.
* A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON
in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and
me.
* Dump direction attribute in nft_ct only if it is set, from Arturo
Borrero.
* Fix IPVS bug in its own connection tracking system that may lead to
copying only 4 bytes of the IPv6 address when initializing the
ip_vs_conn object, from Michal Kubecek.
* Fix -EBUSY errors in nftables when deleting the rules, chain and tables
in a row due mixture of asynchronous and synchronous object releasing,
from me.
* Three fixes for the nf_tables set infrastructure when using intervals and
mappings, from me.
* Four patches to fixing the nf_tables log, reject and ct expressions from
the new inet table, from Patrick McHardy.
* Fix memory overrun in the map that is used to dynamically allocate names
from anonymous sets, also from Patrick.
* Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table
name, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes no sense to inline a rarely used function meant for debugging
only that is called a total of five times in the main evaluation loop.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix access to uninitialized data for end interval elements. The
element data part is uninitialized in interval end elements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes several things which related to the handling of
end interval elements:
* Chain use underflow with intervals and map: If you add a rule
using intervals+map that introduces a loop, the error path of the
rbtree set decrements the chain refcount for each side of the
interval, leading to a chain use counter underflow.
* Don't copy the data part of the end interval element since, this
area is uninitialized and this confuses the loop detection code.
* Don't allocate room for the data part of end interval elements
since this is unused.
So, after this patch the idea is that end interval elements don't
have a data part.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
This combination is not allowed since end interval elements cannot
contain data.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
TCP pacing depends on an accurate srtt estimation.
Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.
We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.
All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.
The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.
This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :
$ ss -i dst cca2
Netid State Recv-Q Send-Q Local Address:Port
Peer Address:Port
tcp ESTAB 0 0 10.244.129.1:56984
10.244.129.2:12865
cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp ESTAB 0 390960 10.244.129.1:60247
10.244.129.2:50204
cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
Reported-by: Vytautas Valancius <valas@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 93d8bf9fb8 ("bridge: cleanup netpoll code") introduced
a check in br_netpoll_enable(), but this check is incorrect for
br_netpoll_setup(). This patch moves the code after the check
into __br_netpoll_enable() and calls it in br_netpoll_setup().
For br_add_if(), the check is still needed.
Fixes: 93d8bf9fb8 ("bridge: cleanup netpoll code")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_alloc_send_pskb() & sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.
Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.
We had various reports from unexpected regressions.
According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :
CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm
Call Trace:
[<ffffffff8043766c>] dump_header+0xe1/0x23e
[<ffffffff80437a02>] oom_kill_process+0x6a/0x323
[<ffffffff80438443>] out_of_memory+0x4b3/0x50d
[<ffffffff8043a4a6>] __alloc_pages_may_oom+0xa2/0xc7
[<ffffffff80236f42>] __alloc_pages_nodemask+0x1002/0x17f0
[<ffffffff8024bd23>] alloc_pages_current+0x103/0x2b0
[<ffffffff8028567f>] sk_page_frag_refill+0x8f/0x160
[<ffffffff80295fa0>] tcp_sendmsg+0x560/0xee0
[<ffffffff802a5037>] inet_sendmsg+0x67/0x100
[<ffffffff80283c9c>] __sock_sendmsg_nosec+0x6c/0x90
[<ffffffff80283e85>] sock_sendmsg+0xc5/0xf0
[<ffffffff802847b6>] __sys_sendmsg+0x136/0x430
[<ffffffff80284ec8>] sys_sendmsg+0x88/0x110
[<ffffffff80711472>] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to make netconsole start over IPv6, the source address
needs to be specified. Without a source address, netpoll_parse_options
assumes we're setting up over IPv4 and the destination IPv6 address is
rejected.
Check if the IP version has been forced by a source address before
checking for a version mismatch when parsing the destination address.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit ee45fd92c7 ("sfc: Use TX PIO
for sufficiently small packets") introduced the following warning:
drivers/net/ethernet/sfc/tx.c: In function 'efx_enqueue_skb':
drivers/net/ethernet/sfc/tx.c:432:1: warning: label 'finish_packet' defined but not used
Stick the label inside the same #ifdef that the code which calls
it uses. Note that this is only seen for arch that do not set
ARCH_HAS_IOREMAP_WC, such as arm, mips, sparc, ..., as the others
enable the write combining code and hence use the label.
Cc: Jon Cooper <jcooper@solarflare.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems like this function was intended to have special handling for
urb statuses of -ENOENT and -ECONNRESET. But now it just prints some
debugging and returns at the start of the function.
I have removed the dead code, it's still in the git history if anyone
wants to revive it.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cfd280c912 ("net: sync some IP headers with glibc") changed a set of
define's to an enum (with no explanation why) which introduced a bug
in module mip6 where aliases are generated using the IPPROTO_* defines;
mip6 doesn't load if require_module called with the aliases from
xfrm_get_type().
Reverting this change back to define's to fix the aliases.
modinfo mip6 (before this change)
alias: xfrm-type-10-IPPROTO_DSTOPTS
alias: xfrm-type-10-IPPROTO_ROUTING
modinfo mip6 (after this change)
alias: xfrm-type-10-43
alias: xfrm-type-10-60
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a static checker fix, but judging from the context then I think
hexidecimal 0x80 is intended here instead of decimal 80.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit efe4208f47:
'ipv6: make lookups simpler and faster' broke initialization of local source
address on accepted ipv6 sockets. Before the mentioned commit receive address
was copied along with the contents of ipv6_pinfo in sctp_v6_create_accept_sk.
Now when it is moved, it has to be copied separately.
This also fixes lksctp's ipv6 regression in a sense that test_getname_v6, TC5 -
'getsockname on a connected server socket' now passes.
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The submission of the interrupt transfer should be done after setting
the bit of WORK_ENABLE, otherwise the callback function would have
the opportunity to be returned directly.
Clear the bit of WORK_ENABLE before killing the interrupt transfer.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bnx2x driver uses incorrect PF identifier to configure (in HW) the VF
interrupt scheme; As a result, in multi-function mode the configuration
for PFs with a high index (4+) will overflow and the PF will erroneously
configure a single ISR scheme for its VFs.
As a result, if such a VF uses multiple queues, interrupt generation will
stop after VF receives an Rx packet or sends a Tx packet on a queue
other than queue[0].
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting David Vrabel -
"5780 cards cannot have jumbo frames and TSO enabled together. When
jumbo frames are enabled by setting the MTU, the TSO feature must be
cleared. This is done indirectly by calling netdev_update_features()
which will call tg3_fix_features() to actually clear the flags.
netdev_update_features() will also trigger a new netlink message for the
feature change event which will result in a call to tg3_get_stats64()
which deadlocks on the tg3 lock."
tg3_set_mtu() does not need to be under the tg3 lock since converting
the flags to use set_bit(). Move it out to after tg3_netif_stop().
Reported-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original code, if tg3_readphy() fails then it does an unnecessary
check to verify "err" is still zero and then returns -EBUSY.
My static checker complains about the unnecessary "if (!err)" check and
anyway it is better to propagate the -EBUSY error code from
tg3_readphy() instead of hard coding it here. And really the original
code is confusing to look at.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Occasionally users want to know what parameters their Broadcom drivers
are running with. For example, a user may want to know if MSI is
disabled.
This patch has been compile tested.
Signed-off-by: James M Leddy <james.leddy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes old and unsupported CLPS711X IrDA driver.
Support for IrDA for CLPS711X serial port now provided by commit
4a33f1f59abd (serial: clps711x: Add support for N_IRDA line
discipline), so IrDA-mode can be turned ON with "irattach" tool
through "irtty" driver.
Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the device tree to the new compatibles introduced in the ethernet and
mdio drivers to have a common pattern accross all Allwinner SoCs.
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Allwinner A10 compatibles were following a slightly different compatible
patterns than the rest of the SoCs for historical reasons. Add compatibles
matching the other pattern to the mdio driver for consistency, and keep the
older one for backward compatibility.
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Allwinner A10 compatibles were following a slightly different compatible
patterns than the rest of the SoCs for historical reasons. Add compatibles
matching the other pattern to the ethernet driver for consistency, and keep the
older one for backward compatibility.
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We may lost race if we flush the rule-set (which happens asynchronously
via call_rcu) and we try to remove the table (that userspace assumes
to be empty).
Fix this by recovering synchronous rule and chain deletion. This was
introduced time ago before we had no batch support, and synchronous
rule deletion performance was not good. Now that we have the batch
support, we can just postpone the purge of old rule in a second step
in the commit phase. All object deletions are synchronous after this
patch.
As a side effect, we save memory as we don't need rcu_head per rule
anymore.
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The log and queue expressions both store the family during ->init() and
use it to deliver packets. This is wrong when used in NFPROTO_INET since
they should both deliver to the actual AF of the packet, not the dummy
NFPROTO_INET.
Use the family from the hook ops to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jesse Gross says:
====================
Open vSwitch
A handful of bug fixes for net/3.14. High level fixes are:
* Regressions introduced by the zerocopy changes, particularly with
old userspaces.
* A few bugs lingering from the introduction of megaflows.
* Overly zealous error checking that is now being triggered frequently
in common cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent patch to fix receive side flow control
(11b57f9025: xen-netback: stop vif thread
spinning if frontend is unresponsive) solved the spinning thread problem,
however caused an another one. The receive side can stall, if:
- [THREAD] xenvif_rx_action sets rx_queue_stopped to true
- [INTERRUPT] interrupt happens, and sets rx_event to true
- [THREAD] then xenvif_kthread sets rx_event to false
- [THREAD] rx_work_todo doesn't return true anymore
Also, if interrupt sent but there is still no room in the ring, it take quite a
long time until xenvif_rx_action realize it. This patch ditch that two variable,
and rework rx_work_todo. If the thread finds it can't fit more skb's into the
ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
kept as 0. Then rx_work_todo will check if:
- there is something to send to the ring (like before)
- there is space for the topmost packet in the queue
I think that's more natural and optimal thing to test than two bool which are
set somewhere else.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The key was missing in the list of valid keys, add it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit c9c8e48597 (netfilter: nf_tables: dump sets in all existing families)
changed nft_ctx_init_from_setattr() to only look up the address family if it
is not NFPROTO_UNSPEC. However if it is NFPROTO_UNSPEC and a table attribute
is given, nftables_afinfo_lookup() will dereference the NULL afi pointer.
Fix by checking for non-NULL afi and also move a check added by that commit
to the proper position.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The map that is used to allocate anonymous sets is indeed
BITS_PER_BYTE * PAGE_SIZE long.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.
Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().
A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.
CPU1 CPU2
____nf_conntrack_find()
nf_ct_put()
destroy_conntrack()
...
init_conntrack
__nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
if (!l4proto->new(ct, skb, dataoff, timeouts))
nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct); (use = 0)
destroy_conntrack()
/* continue to work with CT */
After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():
<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
Lets look at destroy_conntrack:
hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.
But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().
Florian Westphal suggested to check the confirm bit too. I think it's
right.
task 1 task 2 task 3
nf_conntrack_find_get
____nf_conntrack_find
destroy_conntrack
hlist_nulls_del_rcu
nf_conntrack_free
kmem_cache_free
__nf_conntrack_alloc
kmem_cache_alloc
memset(&ct->tuplehash[IP_CT_DIR_MAX],
if (nf_ct_is_dying(ct))
if (!nf_ct_tuple_equal()
I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.
<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>] [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622] [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697] [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770] [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843] [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919] [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063] [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207] [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277] [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346] [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419] [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491] [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562] [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638] [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712] [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785] [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858] [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081] [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151] [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229] [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303] [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378] [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454] [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531] [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The following commands trigger an oops:
# nft -i
nft> add table filter
nft> add chain filter input { type filter hook input priority 0; }
nft> add chain filter test
nft> add rule filter input jump test
nft> delete chain filter test
We need to check the chain use counter before allowing destruction since
we might have references from sets or jump rules.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69341
Reported-by: Matthew Ife <deleriux1@gmail.com>
Tested-by: Matthew Ife <deleriux1@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We want to make sure that the information that we get from the kernel can
be reinjected without troubles. The kernel shouldn't return an attribute
that is not required, or even prohibited.
Dumping unconditionally NFTA_CT_DIRECTION could lead an application in
userspace to interpret that the attribute was originally set, while it
was not.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With subfacets, we'd expect megaflow updates message to carry
the original micro flow. If not, EINVAL is returned and kernel
logs an error message. Now that the user space subfacet layer is
removed, it is expected that flow updates can arrive with a
micro flow other than the original. Change the return code to
EEXIST and remove the kernel error log message.
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
ovs_flow_free() is not called under ovs-lock during packet
execute path (ovs_packet_cmd_execute()). Since packet execute
does not touch flow->mask, there is no need to take that
lock either. So move assert in case where flow->mask is checked.
Found by code inspection.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
commit 43d4be9cb5 (openvswitch: Allow user space
to announce ability to accept unaligned Netlink messages) introduced
OVS_DP_ATTR_USER_FEATURES netlink attribute in datapath responses,
but the attribute size was not taken into account in ovs_dp_cmd_msg_size().
Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Both mega flow mask's reference counter and per flow table mask list
should only be accessed when holding ovs_mutex() lock. However
this is not true with ovs_flow_table_flush(). The patch fixes this bug.
Reported-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
While the zerocopy method is correctly omitted if user space
does not support unaligned Netlink messages. The attribute is
still not padded correctly as skb_zerocopy() will not ensure
padding and the attribute size is no longer pre calculated
though nla_reserve() which ensured padding previously.
This patch applies appropriate padding if a linear data copy
was performed in skb_zerocopy().
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Backend drivers shouldn't transistion to CLOSED unless the frontend is
CLOSED. If a backend does transition to CLOSED too soon then the
frontend may not see the CLOSING state and will not properly shutdown.
So, treat an unexpected backend CLOSED state the same as CLOSING.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When configuring GRE tunnel using OVS, tcp stream is distributed over
all RSS queues which may cause TCP reordering. It happens since OVS
uses L2GRE protocol when kernel gre uses IPGRE.
Patch defaults gre tunnel to L2GRE which allows proper RSS for L2GRE
packets and (implicitly) disables RSS for IPGRE traffic.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This device was mentioned in an OpenWRT forum. Seems to have a "standard"
Sierra Wireless ifnumber to function layout:
0: qcdm
2: nmea
3: modem
8: qmi
9: storage
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the trigger/event is hardcoded to 0, this patch adds
a new command line argument -i to select an arbitrary trigger/
event.
Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MII management bus clock is derived from the MAC clock by dividing it by
MIIMODER register CLKDIV field value. This value may need to be set up
in case it is undefined or its default value is too high (and
communication with PHY is too slow) or too low (and communication with
PHY is impossible). The value of CLKDIV is not specified directly, but
is derived from the MAC clock for the default MII management bus frequency
of 2.5MHz. The MAC clock may be specified in the platform data, or in
the 'clocks' device tree attribute.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>