commit 4576cd469d980317c4edd9173f8b694aa71ea3a3 upstream.
TPACKET_V3 stores variable length frames in fixed length blocks.
Blocks must be able to store a block header, optional private space
and at least one minimum sized frame.
Frames, even for a zero snaplen packet, store metadata headers and
optional reserved space.
In the block size bounds check, ensure that the frame of the
chosen configuration fits. This includes sockaddr_ll and optional
tp_reserve.
Syzbot was able to construct a ring with insuffient room for the
sockaddr_ll in the header of a zero-length frame, triggering an
out-of-bounds write in dev_parse_header.
Convert the comparison to less than, as zero is a valid snap len.
This matches the test for minimum tp_frame_size immediately below.
Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Fixes: eb73190f4fbe ("net/packet: refine check for priv area size")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 993675a3100b16a4c80dfd70cbcde8ea7127b31d ]
If variable length link layer headers result in a packet shorter
than dev->hard_header_len, reset the network header offset. Else
skb->mac_len may exceed skb->len after skb_mac_reset_len.
packet_sendmsg_spkt already has similar logic.
Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9aad13b087ab0a588cd68259de618f100053360e ]
Commit b84bbaf7a6c8 ("packet: in packet_snd start writing at link
layer allocation") ensures that packet_snd always starts writing
the link layer header in reserved headroom allocated for this
purpose.
This is needed because packets may be shorter than hard_header_len,
in which case the space up to hard_header_len may be zeroed. But
that necessary padding is not accounted for in skb->len.
The fix, however, is buggy. It calls skb_push, which grows skb->len
when moving skb->data back. But in this case packet length should not
change.
Instead, call skb_reserve, which moves both skb->data and skb->tail
back, without changing length.
Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation")
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba ]
Packet sockets allow construction of packets shorter than
dev->hard_header_len to accommodate protocols with variable length
link layer headers. These packets are padded to dev->hard_header_len,
because some device drivers interpret that as a minimum packet size.
packet_snd reserves dev->hard_header_len bytes on allocation.
SOCK_DGRAM sockets call skb_push in dev_hard_header() to ensure that
link layer headers are stored in the reserved range. SOCK_RAW sockets
do the same in tpacket_snd, but not in packet_snd.
Syzbot was able to send a zero byte packet to a device with massive
116B link layer header, causing padding to cross over into skb_shinfo.
Fix this by writing from the start of the llheader reserved range also
in the case of packet_snd/SOCK_RAW.
Update skb_set_network_header to the new offset. This also corrects
it for SOCK_DGRAM, where it incorrectly double counted reserve due to
the skb_push in dev_hard_header.
Fixes: 9ed988cd5915 ("packet: validate variable length ll headers")
Reported-by: syzbot+71d74a5406d02057d559@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5171b37d959641bbc619781caf62e61f7b940871 ]
In order to remove the race caught by syzbot [1], we need
to lock the socket before using po->tp_version as this could
change under us otherwise.
This means lock_sock() and release_sock() must be done by
packet_set_ring() callers.
[1] :
BUG: KMSAN: uninit-value in packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
CPU: 0 PID: 20195 Comm: syzkaller707632 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x185/0x1d0 lib/dump_stack.c:53
kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
__msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
packet_setsockopt+0x12c6/0x5a90 net/packet/af_packet.c:3662
SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
SyS_setsockopt+0x76/0xa0 net/socket.c:1828
do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x449099
RSP: 002b:00007f42b5307ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 000000000070003c RCX: 0000000000449099
RDX: 0000000000000005 RSI: 0000000000000107 RDI: 0000000000000003
RBP: 0000000000700038 R08: 000000000000001c R09: 0000000000000000
R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000080eecf R14: 00007f42b53089c0 R15: 0000000000000001
Local variable description: ----req_u@packet_setsockopt
Variable was created at:
packet_setsockopt+0x13f/0x5a90 net/packet/af_packet.c:3612
SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a6361f0ca4b25460f2cdf3235ebe8115f622901e ]
Updates to the bitfields in struct packet_sock are not atomic.
Serialize these read-modify-write cycles.
Move po->running into a separate variable. Its writes are protected by
po->bind_lock (except for one startup case at packet_create). Also
replace a textual precondition warning with lockdep annotation.
All others are set only in packet_setsockopt. Serialize these
updates by holding the socket lock. Analogous to other field updates,
also hold the lock when testing whether a ring is active (pg_vec).
Fixes: 8dc4194474 ("[PACKET]: Add optional checksum computation for recvmsg")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Reported-by: Byoungyoung Lee <byoungyoung@purdue.edu>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzkaller found a race condition fanout_demux_rollover() while removing
a packet socket from a fanout group.
po->rollover is read and operated on during packet_rcv_fanout(), via
fanout_demux_rollover(), but the pointer is currently cleared before the
synchronization in packet_release(). It is safer to delay the cleanup
until after synchronize_net() has been called, ensuring all calls to
packet_rcv_fanout() for this socket have finished.
To further simplify synchronization around the rollover structure, set
po->rollover in fanout_add() only if there are no errors. This removes
the need for rcu in the struct and in the call to
packet_getsockopt(..., PACKET_ROLLOVER_STATS, ...).
Crashing stack trace:
fanout_demux_rollover+0xb6/0x4d0 net/packet/af_packet.c:1392
packet_rcv_fanout+0x649/0x7c8 net/packet/af_packet.c:1487
dev_queue_xmit_nit+0x835/0xc10 net/core/dev.c:1953
xmit_one net/core/dev.c:2975 [inline]
dev_hard_start_xmit+0x16b/0xac0 net/core/dev.c:2995
__dev_queue_xmit+0x17a4/0x2050 net/core/dev.c:3476
dev_queue_xmit+0x17/0x20 net/core/dev.c:3509
neigh_connected_output+0x489/0x720 net/core/neighbour.c:1379
neigh_output include/net/neighbour.h:482 [inline]
ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
NF_HOOK_COND include/linux/netfilter.h:239 [inline]
ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
dst_output include/net/dst.h:459 [inline]
NF_HOOK.constprop.35+0xff/0x630 include/linux/netfilter.h:250
mld_sendpack+0x6a8/0xcc0 net/ipv6/mcast.c:1660
mld_send_initial_cr.part.24+0x103/0x150 net/ipv6/mcast.c:2072
mld_send_initial_cr net/ipv6/mcast.c:2056 [inline]
ipv6_mc_dad_complete+0x99/0x130 net/ipv6/mcast.c:2079
addrconf_dad_completed+0x595/0x970 net/ipv6/addrconf.c:4039
addrconf_dad_work+0xac9/0x1160 net/ipv6/addrconf.c:3971
process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2113
worker_thread+0x223/0x1990 kernel/workqueue.c:2247
kthread+0x35e/0x430 kernel/kthread.c:231
ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:432
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Fixes: 509c7a1ecc860 ("packet: avoid panic in packet_getsockopt()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Mike Maloney <maloney@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 509c7a1ecc8601f94ffba8a00889fefb239c00c6 ]
syzkaller got crashes in packet_getsockopt() processing
PACKET_ROLLOVER_STATS command while another thread was managing
to change po->rollover
Using RCU will fix this bug. We might later add proper RCU annotations
for sparse sake.
In v2: I replaced kfree(rollover) in fanout_add() to kfree_rcu()
variant, as spotted by John.
Fixes: a9b6391814 ("packet: rollover statistics")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: John Sperbeck <jsperbeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit da7c9561015e93d10fe6aab73e9288e0d09d65a6 ]
Packet socket option po->has_vnet_hdr can be updated concurrently with
other operations if no ring is attached.
Do not test the option twice in packet_snd, as the value may change in
between calls. A race on setsockopt disable may cause a packet > mtu
to be sent without having GSO options set.
Fixes: bfd5f4a3d6 ("packet: Add GSO/csum offload support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4971613c1639d8e5f102c4e797c3bf8f83a5a69e ]
Once a socket has po->fanout set, it remains a member of the group
until it is destroyed. The prot_hook must be constant and identical
across sockets in the group.
If fanout_add races with packet_do_bind between the test of po->fanout
and taking the lock, the bind call may make type or dev inconsistent
with that of the fanout group.
Hold po->bind_lock when testing po->fanout to avoid this race.
I had to introduce artificial delay (local_bh_enable) to actually
observe the race.
Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 008ba2a13f2d04c947adc536d19debb8fe66f110 ]
Packet socket bind operations must hold the po->bind_lock. This keeps
po->running consistent with whether the socket is actually on a ptype
list to receive packets.
fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
binds the fanout object to receive through packet_rcv_fanout.
Make it hold the po->bind_lock when testing po->running and rebinding.
Else, it can race with other rebind operations, such as that in
packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
can result in a socket being added to a fanout group twice, causing
use-after-free KASAN bug reports, among others.
Reported independently by both trinity and syzkaller.
Verified that the syzkaller reproducer passes after this patch.
Fixes: dc99f60069 ("packet: Add fanout support.")
Reported-by: nixioaming <nixiaoming@huawei.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fd2c83b35752f0a8236b976978ad4658df14a59f ]
In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.
This bug has been detected with KMSAN.
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c27927e372f0785f3303e8fad94b85945e2c97b7 ]
Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.
This bug was discovered by syzkaller.
Fixes: 8913336a7e ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c800aaf8d869f2b9b47b10c5c312fe19f0a94042 ]
There are multiple reports showing we have a use-after-free in
the timer prb_retire_rx_blk_timer_expired(), where we use struct
tpacket_kbdq_core::pkbdq, a pg_vec, after it gets freed by
free_pg_vec().
The interesting part is it is not freed via packet_release() but
via packet_setsockopt(), which means we are not closing the socket.
Looking into the big and fat function packet_set_ring(), this could
happen if we satisfy the following conditions:
1. closing == 0, not on packet_release() path
2. req->tp_block_nr == 0, we don't allocate a new pg_vec
3. rx_ring->pg_vec is already set as V3, which means we already called
packet_set_ring() wtih req->tp_block_nr > 0 previously
4. req->tp_frame_nr == 0, pass sanity check
5. po->mapped == 0, never called mmap()
In this scenario we are clearing the old rx_ring->pg_vec, so we need
to free this pg_vec, but we don't stop the timer on this path because
of closing==0.
The timer has to be stopped as long as we need to free pg_vec, therefore
the check on closing!=0 is wrong, we should check pg_vec!=NULL instead.
Thanks to liujian for testing different fixes.
Reported-by: alexander.levin@verizon.com
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: liujian (CE) <liujian56@huawei.com>
Tested-by: liujian (CE) <liujian56@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bcc5364bdcfe131e6379363f089e7b4108d35b70 ]
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.
Fix by checking that tp_reserve <= INT_MAX on assign.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8f8d28e4d6d815a391285e121c3a53a0b6cb9e7b ]
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.
Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
Since frames_per_block <= tp_block_size, the expression would
never overflow.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2b6867c2ce76c596676bec7d2d525af525fdc6e2 upstream.
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d199fab63c11998a602205f7ee7ff7c05c97164b ]
Multiple threads can call fanout_add() at the same time.
We need to grab fanout_mutex earlier to avoid races that could
lead to one thread freeing po->rollover that was set by another thread.
Do the same in fanout_release(), for peace of mind, and to help us
finding lockdep issues earlier.
Fixes: dc99f60069 ("packet: Add fanout support.")
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 57031eb794906eea4e1c7b31dc1e2429c0af0c66 ]
Link layer protocols may unconditionally pull headers, as Ethernet
does in eth_type_trans. Ensure that the entire link layer header
always lies in the skb linear segment. tpacket_snd has such a check.
Extend this to packet_snd.
Variable length link layer headers complicate the computation
somewhat. Here skb->len may be smaller than dev->hard_header_len.
Round up the linear length to be at least as long as the smallest of
the two.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 84ac7260236a49c79eede91617700174c2c19b0c ]
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.
This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.
The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.
Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 104ba78c98808ae837d1f63aae58c183db5505df ]
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.
Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.
Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.
Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.
When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.
Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c
ethtool -K eth0 tso off tx on
psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N
ethtool -K eth0 tx off
psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N
v2:
- add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)
Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6664498280cf17a59c3e7cf1a931444c02633ed1 ]
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.
This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()
Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit eb70db8756717b90c01ccc765fdefc4dd969fc74 ]
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.
The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.
But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.
Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports. This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.
Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 309cf37fe2a781279b7675d4bb7173198e532867 ]
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.
Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].
Fixes: eea68e2f1a ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9ed988cd591500c040b2a6257bc68543e08ceeef ]
Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.
Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.
See also http://comments.gmane.org/gmane.linux.network/401064
Fixes 9c7077622d ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 9c7077622d ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.
This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24 <= 24)"
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use PAGE_ALIGNED(...) instead of open-coding it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
rb->frames_per_block is an unsigned int, thus can never be negative.
Also fix spacing in the calculation of frames_per_block.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since it's introduction in commit 69e3c75f4d ("net: TX_RING and
packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
reserve check should have reserve as 0, but currently, this is
unconditionally set (in it's original form as dev->hard_header_len).
I think this is not correct since tpacket_fill_skb() would then
take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
the extra VLAN_HLEN could be possible in both cases. Presumably, the
reserve code was copied from packet_snd(), but later on missed the
check. Make it similar as we have it in packet_snd().
Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case no struct sockaddr_ll has been passed to packet
socket's sendmsg() when doing a TX_RING flush run, then
skb->protocol is set to po->num instead, which is the protocol
passed via socket(2)/bind(2).
Applications only xmitting can go the path of allocating the
socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
TX_RING with sll_protocol of 0. That way, register_prot_hook()
is neither called on creation nor on bind time, which saves
cycles when there's no interest in capturing anyway.
That leaves us however with po->num 0 instead and therefore
the TX_RING flush run sets skb->protocol to 0 as well. Eric
reported that this leads to problems when using tools like
trafgen over bonding device. I.e. the bonding's hash function
could invoke the kernel's flow dissector, which depends on
skb->protocol being properly set. In the current situation, all
the traffic is then directed to a single slave.
Fix it up by inferring skb->protocol from the Ethernet header
when not set and we have ARPHRD_ETHER device type. This is only
done in case of SOCK_RAW and where we have a dev->hard_header_len
length. In case of ARPHRD_ETHER devices, this is guaranteed to
cover ETH_HLEN, and therefore being accessed on the skb after
the skb_store_bits().
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet sockets can be used by various net devices and are not
really restricted to ARPHRD_ETHER device types. However, when
currently checking for the extra 4 bytes that can be transmitted
in VLAN case, our assumption is that we generally probe on
ARPHRD_ETHER devices. Therefore, before looking into Ethernet
header, check the device type first.
This also fixes the issue where non-ARPHRD_ETHER devices could
have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
the check would test unfilled linear part of the skb (instead
of non-linear).
Fixes: 57f89bfa21 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We concluded that the skb_probe_transport_header() should better be
called unconditionally. Avoiding the call into the flow dissector has
also not really much to do with the direct xmit mode.
While it seems that only virtio_net code makes use of GSO from non
RX/TX ring packet socket paths, we should probe for a transport header
nevertheless before they hit devices.
Reference: http://thread.gmane.org/gmane.linux.network/386173/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tpacket_fill_skb() commit c1aad275b0 ("packet: set transport
header before doing xmit") and later on 40893fd0fd ("net: switch
to use skb_probe_transport_header()") was probing for a transport
header on the skb from a ring buffer slot, but at a time, where
the skb has _not even_ been filled with data yet. So that call into
the flow dissector is pretty useless. Lets do it after we've set
up the skb frags.
Fixes: c1aad275b0 ("packet: set transport header before doing xmit")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race conditions between packet_notifier and packet_bind{_spkt}.
It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
time packet_bind{_spkt} takes a reference on the new netdevice and the
time packet_do_bind sets po->ifindex.
In this case the notification can be missed.
If this happens during a dev_change_net_namespace this can result in the
netdevice to be moved to the new namespace while the packet_sock in the
old namespace still holds a reference on it. When the netdevice is later
deleted in the new namespace the deletion hangs since the packet_sock
is not found in the new namespace' &net->packet.sklist.
It can be reproduced with the script below.
This patch makes packet_do_bind check again for the presence of the
netdevice in the packet_sock's namespace after the synchronize_net
in unregister_prot_hook.
More in general it also uses the rcu lock for the duration of the bind
to stop dev_change_net_namespace/rollback_registered_many from
going past the synchronize_net following unlist_netdevice, so that
no NETDEV_UNREGISTER notifications can happen on the new netdevice
while the bind is executing. In order to do this some code from
packet_bind{_spkt} is consolidated into packet_do_dev.
import socket, os, time, sys
proto=7
realDev='em1'
vlanId=400
if len(sys.argv) > 1:
vlanId=int(sys.argv[1])
dev='vlan%d' % vlanId
os.system('taskset -p 0x10 %d' % os.getpid())
s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
os.system('ip link add link %s name %s type vlan id %d' %
(realDev, dev, vlanId))
os.system('ip netns add dummy')
pid=os.fork()
if pid == 0:
# dev should be moved while packet_do_bind is in synchronize net
os.system('taskset -p 0x20000 %d' % os.getpid())
os.system('ip link set %s netns dummy' % dev)
os.system('ip netns exec dummy ip link del %s' % dev)
s.close()
sys.exit(0)
time.sleep(.004)
try:
s.bind(('%s' % dev, proto+1))
except:
print 'Could not bind socket'
s.close()
os.system('ip netns del dummy')
sys.exit(0)
os.waitpid(pid, 0)
s.close()
os.system('ip netns del dummy')
sys.exit(0)
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.
So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv
But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".
We can read non existent memory and crash.
Fixes: c0de08d042 ("af_packet: don't emit packet on orig fanout group")
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF socket filter programs may see junk in 'u32 cb[5]' area,
since it could have been used by protocol layers earlier.
For socket filter programs used in af_packet we need to clean
20 bytes of skb->cb area if it could be used by the program.
For programs attached to TCP/UDP sockets we need to save/restore
these 20 bytes, since it's used by protocol layers.
Remove SK_RUN_FILTER macro, since it's no longer used.
Long term we may move this bpf cb area to per-cpu scratch, but that
requires addition of new 'per-cpu load/store' instructions,
so not suitable as a short term fix.
Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current ongoing effort to dump existing cBPF seccomp filters back
to user space requires to hold the pre-transformed instructions like
we do in case of socket filters from sk_attach_filter() side, so they
can be reloaded in original form at a later point in time by utilities
such as criu.
To prepare for this, simply extend the bpf_prog_create_from_user()
API to hold a flag that tells whether we should store the original
or not. Also, fanout filters could make use of that in future for
things like diag. While fanout filters already use bpf_prog_destroy(),
move seccomp over to them as well to handle original programs when
present.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.
This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
program to select a socket.
Update the internal eBPF program by passing to socket option
SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
to select a socket.
This avoids having to keep adding special case fanout modes. One
example use case is application layer load balancing. The QUIC
protocol, for instance, encodes a connection ID in UDP payload.
Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
only user so far.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Follow e8e85cc5eb ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:
tp_len > dev->mtu + dev->hard_header_len
as dev->mtu and dev->hard_header_len are both unsigned.
That may lead to just returning an incorrect EMSGSIZE errno
to the user.
Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}. However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased. Ultimately, this
caused the deletion of the interface to fail with the following message:
unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.
Fixes: 902fefb82e ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>