The backport of d35c99ff77ec ("netlink: do not enter direct reclaim from
netlink_dump()") to the 4.4 branch (first in 4.4.32) mistakenly removed
direct claim from the initial large allocation _and_ the fallback
allocation which means that allocations can spuriously fail.
Fix the issue by adding back the direct reclaim flag to the fallback
allocation.
Fixes: 6d123f1d39 ("netlink: do not enter direct reclaim from netlink_dump()")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 71d6ad08379304128e4bdfaf0b4185d54375423e upstream.
Don't assume that server is sane and won't return more data than
asked for.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 105f5528b9bbaa08b526d3405a5bcd2ff0c953c8 ]
In situations where an skb is paged, the transport header pointer and
tail pointer can be the same because the skb contents are in frags.
This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a
length of 0 when the length to receive is actually greater than zero.
skb->len is already correctly set in ip6_input_finish() with
pskb_pull(), so use skb->len as it always returns the correct result
for both linear and paged data.
Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 199ab00f3cdb6f154ea93fa76fd80192861a821d ]
Andrey reported a out-of-bound access in ip6_tnl_xmit(), this
is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4
neigh key as an IPv6 address:
neigh = dst_neigh_lookup(skb_dst(skb),
&ipv6_hdr(skb)->daddr);
if (!neigh)
goto tx_err_link_failure;
addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE
addr_type = ipv6_addr_type(addr6);
if (addr_type == IPV6_ADDR_ANY)
addr6 = &ipv6_hdr(skb)->daddr;
memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
Also the network header of the skb at this point should be still IPv4
for 4in6 tunnels, we shold not just use it as IPv6 header.
This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it
is, we are safe to do the nexthop lookup using skb_dst() and
ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which
dest address we can pick here, we have to rely on callers to fill it
from tunnel config, so just fall to ip6_route_output() to make the
decision.
Fixes: ea3dc9601b ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 557c44be917c322860665be3d28376afa84aa936 ]
Andrey reported a fault in the IPv6 route code:
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880069809600 task.stack: ffff880062dc8000
RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
RSP: 0018:ffff880062dced30 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
Call Trace:
ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
...
Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
set. Flags passed to the kernel are blindly copied to the allocated
rt6_info by ip6_route_info_create making a newly inserted route appear
as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
and expects rt->dst.from to be set - which it is not since it is not
really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
generates the fault.
Fix by checking for the flag and failing with EINVAL.
Fixes: d52d3997f8 ("ipv6: Create percpu rt6_info")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 34b2789f1d9bf8dcca9b5cb553d076ca2cd898ee ]
Now sctp doesn't check sock's state before listening on it. It could
even cause changing a sock with any state to become a listening sock
when doing sctp_listen.
This patch is to fix it by checking sock's state in sctp_listen, so
that it will listen on the sock with right state.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a8801799c6975601fd58ae62f48964caec2eb83f ]
inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
ip_route_input when iif is given. If a multipath route is present for
the designated destination, ip_multipath_icmp_hash ends up being called,
which uses the source/destination addresses within the skb to calculate
a hash. However, those are not set in the synthetic skb, causing it to
return an arbitrary and incorrect result.
Instead, use UDP, which gets no such special treatment.
Signed-off-by: Florian Larysch <fl@n621.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 249ee819e24c180909f43c1173c8ef6724d21faf ]
PPP pseudo-wire type is 7 (11 is L2TP_PWTYPE_IP).
Fixes: f1f39f9110 ("l2tp: auto load type modules")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e08293a4ccbcc993ded0fdc46f1e57926b833d63 ]
Take a reference on the sessions returned by l2tp_session_find_nth()
(and rename it l2tp_session_get_nth() to reflect this change), so that
caller is assured that the session isn't going to disappear while
processing it.
For procfs and debugfs handlers, the session is held in the .start()
callback and dropped in .show(). Given that pppol2tp_seq_session_show()
dereferences the associated PPPoL2TP socket and that
l2tp_dfs_seq_session_show() might call pppol2tp_show(), we also need to
call the session's .ref() callback to prevent the socket from going
away from under us.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Fixes: 0ad6614048 ("l2tp: Add debugfs files for dumping l2tp debug info")
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bcc5364bdcfe131e6379363f089e7b4108d35b70 ]
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.
Fix by checking that tp_reserve <= INT_MAX on assign.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8f8d28e4d6d815a391285e121c3a53a0b6cb9e7b ]
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.
Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
Since frames_per_block <= tp_block_size, the expression would
never overflow.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e91793bb615cf6cdd59c0b6749fe173687bb0947 ]
The Rx path may grab the socket right before pppol2tp_release(), but
nothing guarantees that it will enqueue packets before
skb_queue_purge(). Therefore, the socket can be destroyed without its
queues fully purged.
Fix this by purging queues in pppol2tp_session_destruct() where we're
guaranteed nothing is still referencing the socket.
Fixes: 9e9cb6221a ("l2tp: fix userspace reception on plain L2TP sockets")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 48481c8fa16410ffa45939b13b6c53c2ca609e5f ]
Dmitry posted a nice reproducer of a bug triggering in neigh_probe()
when dereferencing a NULL neigh->ops->solicit method.
This can happen for arp_direct_ops/ndisc_direct_ops and similar,
which can be used for NUD_NOARP neighbours (created when dev->header_ops
is NULL). Admin can then force changing nud_state to some other state
that would fire neigh timer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e47db94e10447fc467777a40302f2b393e9af2fa upstream.
Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.
Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.
Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc327f8931cb9d66191f489eb9a852fc04530546 upstream.
We saw the following extra refcount release on veth device:
kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1
Since we heavily use mirred action to redirect packets to veth, I think
this is caused by the following race condition:
CPU0:
tcf_mirred_release(): (in RCU callback)
struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1);
CPU1:
mirred_device_event():
spin_lock_bh(&mirred_list_lock);
list_for_each_entry(m, &mirred_list, tcfm_list) {
if (rcu_access_pointer(m->tcfm_dev) == dev) {
dev_put(dev);
/* Note : no rcu grace period necessary, as
* net_device are already rcu protected.
*/
RCU_INIT_POINTER(m->tcfm_dev, NULL);
}
}
spin_unlock_bh(&mirred_list_lock);
CPU0:
tcf_mirred_release():
spin_lock_bh(&mirred_list_lock);
list_del(&m->tcfm_list);
spin_unlock_bh(&mirred_list_lock);
if (dev) // <======== Stil refers to the old m->tcfm_dev
dev_put(dev); // <======== dev_put() is called on it again
The action init code path is good because it is impossible to modify
an action that is being removed.
So, fix this by moving everything under the spinlock.
Fixes: 2ee22a90c7 ("net_sched: act_mirred: remove spinlock in fast path")
Fixes: 6bd00b8506 ("act_mirred: fix a race condition on mirred_list")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 43a6684519ab0a6c52024b5e25322476cabad893 upstream.
We got a report of yet another bug in ping
http://www.openwall.com/lists/oss-security/2017/03/24/6
->disconnect() is not called with socket lock held.
Fix this by acquiring ping rwlock earlier.
Thanks to Daniel, Alexander and Andrey for letting us know this problem.
Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3de81b758853f0b29c61e246679d20b513c4cfec upstream.
Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.
As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.
Fixes: b97bf3fd8f ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4:
- Adjust context
- NETDEV_GOING_DOWN and NETDEV_CHANGEMTU cases in net notifier were combined]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c58d6c93680f28ac58984af61d0a7ebf4319c241 upstream.
If nlh->nlmsg_len is zero then an infinite loop is triggered because
'skb_pull(skb, msglen);' pulls zero bytes.
The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len <
NLMSG_HDRLEN' which bypasses the length validation and will later
trigger an out-of-bound read.
If the length validation does fail then the malformed batch message is
copied back to userspace. However, we cannot do this because the
nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in
netlink_ack:
[ 41.455421] ==================================================================
[ 41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340
[ 41.456431] Read of size 4294967280 by task a.out/987
[ 41.456431] =============================================================================
[ 41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected
[ 41.456431] -----------------------------------------------------------------------------
...
[ 41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00 ................
[ 41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00 ...............
[ 41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05 .......@EV."3...
[ 41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb ................
^^ start of batch nlmsg with
nlmsg_len=4294967280
...
[ 41.456431] Memory state around the buggy address:
[ 41.456431] ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 41.456431] ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
[ 41.456431] ^
[ 41.456431] ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 41.456431] ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
[ 41.456431] ==================================================================
Fix this with better validation of nlh->nlmsg_len and by setting
NFNL_BATCH_FAILURE if any batch message fails length validation.
CAP_NET_ADMIN is required to trigger the bugs.
Fixes: 9ea2aa8b7d ("netfilter: nfnetlink: validate nfnetlink header from batch")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f1d048f24e66ba85d3dabf3d076cefa5f2b546b0 upstream.
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.
We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.
Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d2f394dc4816b7bd1b44981d83509f18f19c53f0 upstream.
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.
When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.
When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).
Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.
In this commit, we adjust the size of name distributor message so that
they can be tunnelled.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c4282ca76c5b81ed73ef4c5eb5c07ee397e51642 upstream.
commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47 ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c8bcfb1255fe9d929c227d67bdcd84430fd200b upstream.
In the refactoring commit d570d86497 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test
if (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);
with
if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);
This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.
We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 541726abe7daca64390c2ec34e6a203145f1686d upstream.
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9bd160bfa27fa41927dbbce7ee0ea779700e09ef upstream.
Expand headroom further in order to be able to fit the larger IPv6
header. Prior to this patch this caused a skb under panic for certain
tipc packets when using IPv6 UDP bearer(s).
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d25a01257e422a4bdeb426f69529d57c73b235fe upstream.
When the TIPC module is unloaded, we have identified a race condition
that allows a node reference counter to go to zero and the node instance
being freed before the node timer is finished with accessing it. This
leads to occasional crashes, especially in multi-namespace environments.
The scenario goes as follows:
CPU0:(node_stop) CPU1:(node_timeout) // ref == 2
1: if(!mod_timer())
2: if (del_timer())
3: tipc_node_put() // ref -> 1
4: tipc_node_put() // ref -> 0
5: kfree_rcu(node);
6: tipc_node_get(node)
7: // BOOM!
We now clean up this functionality as follows:
1) We remove the node pointer from the node lookup table before we
attempt deactivating the timer. This way, we reduce the risk that
tipc_node_find() may obtain a valid pointer to an instance marked
for deletion; a harmless but undesirable situation.
2) We use del_timer_sync() instead of del_timer() to safely deactivate
the node timer without any risk that it might be reactivated by the
timeout handler. There is no risk of deadlock here, since the two
functions never touch the same spinlocks.
3: We remove a pointless tipc_node_get() + tipc_node_put() from the
timeout handler.
Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3018e947d7fd536d57e2b550c33e456d921fff8c upstream.
AP/AP_VLAN modes don't accept any real 802.11 multicast data
frames, but since they do need to accept broadcast management
frames the same is currently permitted for data frames. This
opens a security problem because such frames would be decrypted
with the GTK, and could even contain unicast L3 frames.
Since the spec says that ToDS frames must always have the BSSID
as the RA (addr1), reject any other data frames.
The problem was originally reported in "Predicting, Decrypting,
and Abusing WPA2/802.11 Group Keys" at usenix
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/vanhoef
and brought to my attention by Jouni.
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
--
commit 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 upstream.
The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.
Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.
Reviewed-by: George Zhang <georgezhang@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 upstream.
commit 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf")
attempted to avoid a BUG_ON call when the association being used for a
sendmsg() is blocked waiting for more sndbuf and another thread did a
peeloff operation on such asoc, moving it to another socket.
As Ben Hutchings noticed, then in such case it would return without
locking back the socket and would cause two unlocks in a row.
Further analysis also revealed that it could allow a double free if the
application managed to peeloff the asoc that is created during the
sendmsg call, because then sctp_sendmsg() would try to free the asoc
that was created only for that call.
This patch takes another approach. It will deny the peeloff operation
if there is a thread sleeping on the asoc, so this situation doesn't
exist anymore. This avoids the issues described above and also honors
the syscalls that are already being handled (it can be multiple sendmsg
calls).
Joint work with Xin Long.
Fixes: 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf")
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c2ed1880fd61a998e3ce40254a99a2ad000f1a7d upstream.
The protocol field is checked when deleting IPv4 routes, but ignored for
IPv6, which causes problems with routing daemons accidentally deleting
externally set routes (observed by multiple bird6 users).
This can be verified using `ip -6 route del <prefix> proto something`.
Signed-off-by: Mantas Mikulėnas <grawity@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1cded9d2974fe4fe339fc0ccd6638b80d465ab2c upstream.
There are two problems with refcounting of auth_gss messages.
First, the reference on the pipe->pipe list (taken by a call
to rpc_queue_upcall()) is not counted. It seems to be
assumed that a message in pipe->pipe will always also be in
pipe->in_downcall, where it is correctly reference counted.
However there is no guaranty of this. I have a report of a
NULL dereferences in rpc_pipe_read() which suggests a msg
that has been freed is still on the pipe->pipe list.
One way I imagine this might happen is:
- message is queued for uid=U and auth->service=S1
- rpc.gssd reads this message and starts processing.
This removes the message from pipe->pipe
- message is queued for uid=U and auth->service=S2
- rpc.gssd replies to the first message. gss_pipe_downcall()
calls __gss_find_upcall(pipe, U, NULL) and it finds the
*second* message, as new messages are placed at the head
of ->in_downcall, and the service type is not checked.
- This second message is removed from ->in_downcall and freed
by gss_release_msg() (even though it is still on pipe->pipe)
- rpc.gssd tries to read another message, and dereferences a pointer
to this message that has just been freed.
I fix this by incrementing the reference count before calling
rpc_queue_upcall(), and decrementing it if that fails, or normally in
gss_pipe_destroy_msg().
It seems strange that the reply doesn't target the message more
precisely, but I don't know all the details. In any case, I think the
reference counting irregularity became a measureable bug when the
extra arg was added to __gss_find_upcall(), hence the Fixes: line
below.
The second problem is that if rpc_queue_upcall() fails, the new
message is not freed. gss_alloc_msg() set the ->count to 1,
gss_add_msg() increments this to 2, gss_unhash_msg() decrements to 1,
then the pointer is discarded so the memory never gets freed.
Fixes: 9130b8dbc6ac ("SUNRPC: allow for upcalls for same uid but different gss service")
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1011250
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2b6867c2ce76c596676bec7d2d525af525fdc6e2 upstream.
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f843ee6dd019bcece3e74e76ad9df0155655d0df upstream.
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 677e806da4d916052585301785d847c3b3e6186a upstream.
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
CVE-2017-7184
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c282222a45cb9503cbfbebfdb60491f06ae84b49 upstream.
Dmitry reports following splat:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 0 PID: 13059 Comm: syz-executor1 Not tainted 4.10.0-rc7-next-20170207 #1
[..]
spin_lock_bh include/linux/spinlock.h:304 [inline]
xfrm_policy_flush+0x32/0x470 net/xfrm/xfrm_policy.c:963
xfrm_policy_fini+0xbf/0x560 net/xfrm/xfrm_policy.c:3041
xfrm_net_init+0x79f/0x9e0 net/xfrm/xfrm_policy.c:3091
ops_init+0x10a/0x530 net/core/net_namespace.c:115
setup_net+0x2ed/0x690 net/core/net_namespace.c:291
copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
SYSC_unshare kernel/fork.c:2281 [inline]
Problem is that when we get error during xfrm_net_init we will call
xfrm_policy_fini which will acquire xfrm_policy_lock before it was
initialized. Just move it around so locks get set up first.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 283bc9f35b ("xfrm: Namespacify xfrm state/policy locks")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea90e0dc8cecba6359b481e24d9c37160f6f524f upstream.
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.
To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs. Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.
Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.
Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 15bb7745e94a665caf42bfaabf0ce062845b533b ]
icsk_ack.lrcvtime has a 0 value at socket creation time.
tcpi_last_data_recv can have bogus value if no payload is ever received.
This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a97e50cc4cb67e1e7bff56f6b41cda62ca832336 ]
In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.
sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.
The issue is that commit 278571baca ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.
Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.
Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.
Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.
Fixes: 278571baca ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c64c0b3cac4c5b8cb093727d2c19743ea3965c0b ]
Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl
It turns out nl_fib_input() sanity tests on user input is a bit
wrong :
User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7df9c24625b9981779afb8fcdbe2bb4765e61147 ]
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().
The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.
Commit 6209344f5a ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.
This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.
kernel BUG at net/unix/garbage.c:149!
RIP: 0010:[<ffffffff8717ebf4>] [<ffffffff8717ebf4>]
unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
Call Trace:
[<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
[<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
[<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
[<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
[<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
[<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
[<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
[<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
[<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
[<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
[<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
[<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
[<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
[<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
[<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
[<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
[<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
arch/x86/entry/common.c:259
[<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ]
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()
TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.
iSCSI is using such sockets.
Fixes: 5640f76858 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3d20f1f7bd575d147ffa75621fa560eea0aec690 ]
When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.
Fixes: 6b26ba3a7d ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit edb9d1bff4bbe19b8ae0e71b1f38732591a9eeb2 ]
When tc actions are loaded as a module and no actions have been installed,
flushing them would result in actions removed from the memory, but modules
reference count not being decremented, so that the modules would not be
unloaded.
Following is example with GACT action:
% sudo modprobe act_gact
% lsmod
Module Size Used by
act_gact 16384 0
%
% sudo tc actions ls action gact
%
% sudo tc actions flush action gact
% lsmod
Module Size Used by
act_gact 16384 1
% sudo tc actions flush action gact
% lsmod
Module Size Used by
act_gact 16384 2
% sudo rmmod act_gact
rmmod: ERROR: Module act_gact is in use
....
After the fix:
% lsmod
Module Size Used by
act_gact 16384 0
%
% sudo tc actions add action pass index 1
% sudo tc actions add action pass index 2
% sudo tc actions add action pass index 3
% lsmod
Module Size Used by
act_gact 16384 3
%
% sudo tc actions flush action gact
% lsmod
Module Size Used by
act_gact 16384 0
%
% sudo tc actions flush action gact
% lsmod
Module Size Used by
act_gact 16384 0
% sudo rmmod act_gact
% lsmod
Module Size Used by
%
Fixes: f97017cdef ("net-sched: Fix actions flushing")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 72ef9c4125c7b257e3a714d62d778ab46583d6a3 ]
This patch fixes a memory leak, which happens if the connection request
is not fulfilled between parsing the DCCP options and handling the SYN
(because e.g. the backlog is full), because we forgot to free the
list of ack vectors.
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0 ]
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:
#8 [] page_fault at ffffffff8163e648
[exception RIP: __tcp_ack_snd_check+74]
.
.
#9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8
Of course it may happen with other NIC drivers as well.
It's found the freed dst_entry here:
224 static bool tcp_in_quickack_mode(struct sock *sk)↩
225 {↩
226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
228 ↩
229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
231 }↩
But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.
All the vmcores showed 2 significant clues:
- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.
- All vmcores showed a postitive LockDroppedIcmps value, e.g:
LockDroppedIcmps 267
A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:
do_redirect()->__sk_dst_check()-> dst_release().
Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.
To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.
The dccp/IPv6 code is very similar in this respect, so fixing it there too.
As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().
Fixes: ceb3320610 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a13b2082ece95247779b9995c4e91b4246bed023 ]
Andreas reports kernel oops during rmmod of the br_netfilter module.
Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
Problem is that br_netfilter has the nasty concept of adding a fake
rtable to skb->dst; this happens in a br_netfilter prerouting hook.
A second hook (in bridge LOCAL_IN) is supposed to remove these again
before the skb is handed up the stack.
However, on module unload hooks get unregistered which means an
skb could traverse the prerouting hook that attaches the fake_rtable,
while the 'fake rtable remove' hook gets removed from the hooklist
immediately after.
Fixes: 34666d467c ("netfilter: bridge: move br_netfilter out of the core")
Reported-by: Andreas Karis <akaris@redhat.com>
Debugged-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>