android_kernel_oneplus_msm8998/net
Daniel Borkmann cfa7c16d45 net, sched: respect rcu grace period on cls destruction
[ Upstream commit d936377414fadbafb4d17148d222fe45ca5442d4 ]

Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Fixes: 952313bd62 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Roi Dayan <roid@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10 19:07:23 +01:00
..
6lowpan 6lowpan: put mcast compression in an own function 2015-10-21 00:49:25 +02:00
9p IB/cma: Add support for network namespaces 2015-10-28 12:32:48 -04:00
802
8021q net: add recursion limit to GRO 2016-11-15 07:46:38 +01:00
appletalk
atm atm: deal with setting entry before mkip was called 2015-09-17 22:13:32 -07:00
ax25 AX.25: Close socket connection on session completion 2016-07-11 09:31:12 -07:00
batman-adv batman-adv: remove unused callback from batadv_algo_ops struct 2016-10-07 15:23:47 +02:00
bluetooth Bluetooth: Fix l2cap_sock_setsockopt() with optname BT_RCVMTU 2016-08-20 18:09:19 +02:00
bridge bridge: multicast: restore perm router ports on multicast enable 2016-11-15 07:46:38 +01:00
caif net: caif: fix misleading indentation 2016-09-30 10:18:35 +02:00
can can: bcm: fix warning in bcm_connect/proc_register 2016-11-26 09:54:52 +01:00
ceph libceph: apply new_state before new_up_client on incrementals 2016-08-10 11:49:29 +02:00
core rtnetlink: fix FDB size computation 2016-12-10 19:07:22 +01:00
dcb net/dcb: make dcbnl.c explicitly non-modular 2015-10-09 07:52:27 -07:00
dccp ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped 2016-11-21 10:06:40 +01:00
decnet decnet: Do not build routes to devices without decnet private data. 2016-05-18 17:06:35 -07:00
dns_resolver net: dns_resolver: convert time_t to time64_t 2015-11-18 16:27:46 -05:00
dsa net: dsa: use switchdev obj for VLAN add/del ops 2015-11-01 15:56:11 -05:00
ethernet net: add recursion limit to GRO 2016-11-15 07:46:38 +01:00
hsr net/hsr: fix a warning message 2015-11-23 14:56:15 -05:00
ieee802154 net: fix percpu memory leaks 2015-11-02 22:47:14 -05:00
ipv4 tcp: take care of truncations done by sk_filter() 2016-11-21 10:06:40 +01:00
ipv6 ip6_tunnel: disable caching when the traffic class is inherited 2016-12-10 19:07:22 +01:00
ipx
irda net/irda: handle iriap_register_lsap() allocation failure 2016-09-30 10:18:36 +02:00
iucv af_iucv: Validate socket address length in iucv_sock_bind() 2016-03-03 15:07:03 -08:00
key af_key: fix two typos 2015-10-23 03:05:19 -07:00
l2tp l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() 2016-12-10 19:07:23 +01:00
l3mdev net: Add netif_is_l3_slave 2015-10-07 04:27:43 -07:00
lapb
llc net: fix infoleak in llc 2016-05-18 17:06:40 -07:00
mac80211 mac80211: discard multicast and 4-addr A-MSDUs 2016-11-10 16:36:35 +01:00
mac802154 mac802154: llsec: use kzfree 2015-10-21 00:49:24 +02:00
mpls mpls: find_outdev: check for err ptr in addition to NULL check 2016-04-20 15:42:07 +09:00
netfilter netfilter: nft_dynset: fix element timeout for HZ != 1000 2016-11-26 09:54:54 +01:00
netlabel netlabel: add address family checks to netlbl_{sock,req}_delattr() 2016-08-20 18:09:22 +02:00
netlink netlink: do not enter direct reclaim from netlink_dump() 2016-11-15 07:46:37 +01:00
netrom
nfc net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
openvswitch vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices 2016-06-24 10:18:18 -07:00
packet packet: on direct_xmit, limit tso and csum to supported devices 2016-11-15 07:46:39 +01:00
phonet phonet: properly unshare skbs in phonet_rcv() 2016-01-31 11:29:00 -08:00
rds rds: fix an infoleak in rds_inc_info_copy 2016-09-15 08:27:51 +02:00
rfkill rfkill: fix rfkill_fop_read wait_event usage 2016-03-03 15:07:26 -08:00
rose
rxrpc net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
sched net, sched: respect rcu grace period on cls destruction 2016-12-10 19:07:23 +01:00
sctp sctp: assign assoc_id earlier in __sctp_connect 2016-11-21 10:06:40 +01:00
sunrpc sunrpc: fix write space race causing stalls 2016-10-28 03:01:31 -04:00
switchdev switchdev: pass pointer to fib_info instead of copy 2016-06-24 10:18:16 -07:00
tipc tipc: fix NULL pointer dereference in shutdown() 2016-09-30 10:18:36 +02:00
unix af_unix: conditionally use freezable blocking calls in read 2016-12-10 19:07:22 +01:00
vmw_vsock VSOCK: do not disconnect socket when peer has shutdown SEND only 2016-05-18 17:06:41 -07:00
wimax net:wimax: Fix doucble word "the the" in networking.xml 2015-08-09 22:43:52 -07:00
wireless cfg80211: limit scan results cache size 2016-12-02 09:09:01 +01:00
x25 net: fix a kernel infoleak in x25 module 2016-05-18 17:06:43 -07:00
xfrm xfrm: Fix crash observed during device unregistration and decryption 2016-04-20 15:42:05 +09:00
compat.c
Kconfig net: Introduce L3 Master device abstraction 2015-09-29 20:40:32 -07:00
Makefile net: Introduce L3 Master device abstraction 2015-09-29 20:40:32 -07:00
socket.c sock: fix sendmmsg for partial sendmsg 2016-11-21 10:06:40 +01:00
sysctl_net.c net: Use ns_capable_noaudit() when determining net sysctl permissions 2016-09-15 08:27:50 +02:00