[ Upstream commit 4cfc86f3dae6ca38ed49cdd78f458a03d4d87992 ]
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
before calling fib_lookup(), which means fib_table_lookup() is
using non-deterministic data at this line:
if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
Fix by initializing the entire fl4 structure, which will prevent
similar issues as fields are added in the future by ensuring that
all fields are initialized to zero unless explicitly initialized
to another value.
Fixes: 58189ca7b2 ("net: Fix vti use case with oif in dst lookups")
Suggested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ad0ea1989cc4d5905941d0a9e62c63ad6d859cef ]
Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.
This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.
Fixes: 6e54030932 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e316ea62e3203d524ff0239a40c56d3a39ad1b5c ]
Now SYN_RECV request sockets are installed in ehash table, an ICMP
handler can find a request socket while another cpu handles an incoming
packet transforming this SYN_RECV request socket into an ESTABLISHED
socket.
We need to remove the now obsolete WARN_ON(req->sk), since req->sk
is set when a new child is created and added into listener accept queue.
If this race happens, the ICMP will do nothing special.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ben Lazarus <blazarus@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fbd40ea0180a2d328c5adc61414dc8bab9335ce2 ]
When an inetdev is destroyed, every address assigned to the interface
is removed. And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:
1) Address promotion. We are deleting all addresses, so there is no
point in doing this.
2) A full nf conntrack table purge for every address. We only need to
do this once, as is already caught by the existing
masq_dev_notifier so masq_inet_event() can skip this.
Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a9d99ce28ed359d68cf6f3c1a69038aefedf6d6a ]
If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in
While we are at it, starts segs_in with one, to count the SYN packet.
We do not yet count number of SYN we received for a request sock, we
might add this someday.
packetdrill script showing proper behavior after fix :
// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.020 < P. 1:1001(1000) ack 1 win 32792
+0 accept(3, ..., ...) = 4
+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%
Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1837b2e2bcd23137766555a63867e649c0b637f0 ]
The current reserved_tailroom calculation fails to take hlen and tlen into
account.
skb:
[__hlen__|__data____________|__tlen___|__extra__]
^ ^
head skb_end_offset
In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:
[__hlen__|__data____________|__extra__|__tlen___]
^ ^
head skb_end_offset
The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.
The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
reserved_tailroom = skb_tailroom - mtu
else:
reserved_tailroom = tlen
Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.
Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a8c4a2522a0808c5c2143612909717d1115c40cf ]
Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
skbs down. This can easily happen with UDP+IPv4 sockets with the first
MSG_MORE write smaller than the MTU, second write is a sendfile.
Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
were we calculate the checksum manually during copying.
Commit d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
sockets") started to exposes this bug.
Fixes: d749c9cbff ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
Reported-by: Jiri Benc <jbenc@redhat.com>
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: Wakko Warner <wakko@animx.eu.org>
Cc: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5146d1f151122e868e594c7b45115d64825aee5f ]
IPCB may contain data from previous layers (in the observed case the
qdisc layer). In the observed scenario, the data was misinterpreted as
ip header options, which later caused the ihl to be set to an invalid
value (<5). This resulted in an infinite loop in the mips implementation
of ip_fast_csum.
This patch clears IPCB(skb)->opt before dst_link_failure can be called for
various types of tunnels. This change only applies to encapsulated ipv4
packets.
The code introduced in 11c21a30 which clears all of IPCB has been removed
to be consistent with these changes, and instead the opt field is cleared
unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
SIT, GRE, and IPIP tunnels.
The relevant vti, l2tp, and pptp functions already contain similar code for
clearing the IPCB.
Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9bdfb3b79e61c60e1a3e2dc05ad164528afa6b8a ]
Currently it's converted into msecs, thus HZ=1000 intact.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a97eb33ff225f34a8124774b3373fd244f0e83ce ]
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7716682cc58e305e22207d5bb315f26af6b1e243 ]
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit deed49df7390d5239024199e249190328f1651e7 ]
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 729235554d805c63e5e274fcc6a98e71015dd847 ]
If tcp_v4_inbound_md5_hash() returns an error, we must release
the refcount on the request socket, not on the listener.
The bug was added for IPv4 only.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 919483096bfe75dda338e98d56da91a263746a0a ]
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.
Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ff5d749772018602c47509bdc0093ff72acd82ec ]
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.
It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.
Current iproute2 package does not trigger this particular issue.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]
Neal reported crashes with this stack trace :
RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
...
CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
[<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
[<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
[<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
[<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
[<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
[<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
[<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
[<ffffffff8c17623e>] sock_aio_read+0xee/0x110
[<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
[<ffffffff8c0673a1>] SyS_read+0x1e1/0x290
[<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b
The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.
Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.
Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).
IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
See the following build log splats. The sock_i_uid() helper doesn't
quite treat the parameter as 'const' (it acquires a member lock), but
this cast is the same approach taken by other callers in this file, so I
don't feel too bad about the fix.
CC net/ipv4/inet_connection_sock.o
CC net/ipv6/inet6_connection_sock.o
net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_route_req’:
net/ipv6/inet6_connection_sock.c:89:2: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/linux/tcp.h:22:0,
from include/linux/ipv6.h:73,
from net/ipv6/inet6_connection_sock.c:18:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
net/ipv4/inet_connection_sock.c: In function ‘inet_csk_route_req’:
net/ipv4/inet_connection_sock.c:423:7: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/net/inet_sock.h:27:0,
from include/net/inet_connection_sock.h:23,
from net/ipv4/inet_connection_sock.c:19:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
net/ipv4/inet_connection_sock.c: In function ‘inet_csk_route_child_sock’:
net/ipv4/inet_connection_sock.c:460:7: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/net/inet_sock.h:27:0,
from include/net/inet_connection_sock.h:23,
from net/ipv4/inet_connection_sock.c:19:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
Change-Id: I5c156fc1a81f90323717bffd93c31d205b85620c
Signed-off-by: Brian Norris <briannorris@google.com>
Lorenzo reported that we could not properly find v4mapped sockets
in inet_diag_find_one_icsk(). This patch fixes the issue.
[cherry-pick of fc439d9489479411fbf9bbbec2c768df89e85503]
Change-Id: I13515e83fb76d4729f00047f9eb142c929390fb2
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.
Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.
Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.
[cherry-pick of net-next 2010b93e9317cc12acd20c4aed385af7f9d1681e]
Change-Id: I0555a142f11d8b36362ffd7c8ef4a5ecae8987c9
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.
Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.
[cherry-pick of net-next 07f6f4a31e5a8dee67960fc07bb0b37c5f879d4d]
Change-Id: I8ec6b2e6ec24f330a69595abf1d5469ace79b3fd
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
[cherry-pick of net-next c1e64e298b8cad309091b95d8436a0255c84f54a]
Change-Id: I728a01ef03f2ccfb9016a3f3051ef00975980e49
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.
Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.
[backport of net-next 6eb5d2e08f071c05ecbe135369c9ad418826cab2]
Change-Id: Idc13a7def20f492a5323ad2f8de105426293bd37
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.
[cherry-pick of net-next b613f56ec9baf30edf5d9d607b822532a273dad7]
Change-Id: I144765afb6ff1cd66eb4757c9418112fb0b08a6f
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_IPV6=m is selected, we are getting following build errors.
net/built-in.o: In function `tcp_is_local6':
net/ipv4/tcp.c:3261: undefined reference to `rt6_lookup'
Making the code conditional upon only CONFIG_IPV6=y fixes this issue.
Also export tcp_nuke_addr to build IPv6 modules. Otherwise
we run into following build error:
CC [M] lib/zlib_deflate/deftree.o
CC [M] lib/zlib_deflate/deflate_syms.o
LD [M] lib/zlib_deflate/zlib_deflate.o
Building modules, stage 2.
MODPOST 46 modules
ERROR: "tcp_nuke_addr" [net/ipv6/ipv6.ko] undefined!
make[2]: *** [__modpost] Error 1
Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: John Stultz <john.stultz@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
c7c3ec4903d32c60423ee013d96e94602f66042c cherry-picked the
tcp_nuke_addr ioctl, but omitted a check that ensures that a
socket is an IPv6 socket. This makes it so that if we issue a
SIOCKILLADDR on ::, it kills IPv4 sockets as well.
This is because every IPv4 socket has an IPv6 source address
(sk_v6_rcv_saddr) of ::. Thus, when we iterate over an IPv4
socket, and compare the source address of the socket to the
source address in the ioctl, it matches the :: that was passed
in, and we kill the socket.
Change-Id: I736431a898e6ec91536536d352936a210aa10100
The default initial rwnd is hardcoded to 10.
Now we allow it to be controlled via
/proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100
This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.
See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>
Conflicts:
net/ipv4/sysctl_net_ipv4.c
net/ipv4/tcp_input.c
This contains the following commits:
1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
IPv4 PMTUD already does this because ipv4_sk_update_pmtu
uses __build_flow_key, which includes the UID.
Bug: 15413527
Change-Id: Iae3d4ca3979d252b6cec989bdc1a6875f811f03a
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Small build fixes for xt_quota2 and ipv4 changes
Change-Id: Ib098768040c8875887b2081c3165a6c83b37e180
Signed-off-by: John Stultz <john.stultz@linaro.org>
The actual size of the tcp hashinfo table is tcp_hashinfo.ehash_mask + 1
so we need to adjust the loop accordingly to get the sockets hashed into
the last bucket.
Change-Id: I796b3c7b4a1a7fa35fba9e5192a4a403eb6e17de
Signed-off-by: Dmitry Torokhov <dtor@google.com>
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.
Signed-off-by: Brian Swetland <swetland@google.com>
net: fix tcp_v4_nuke_addr
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.
We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.
Signed-off-by: Dima Zavin <dima@android.com>
net: ipv4: Fix race in tcp_v4_nuke_addr().
To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.
Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.
Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>
ipv4: disable bottom halves around call to tcp_done().
Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr
When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.
This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.
Signed-off-by: Colin Cross <ccross@android.com>
Add a family of knobs to /sys/kernel/ipv4 for controlling the TCP window size:
tcp_wmem_min
tcp_wmem_def
tcp_wmem_max
tcp_rmem_min
tcp_rmem_def
tcp_rmem_max
This six values mirror the sysctl knobs in /proc/sys/net/ipv4/tcp_wmem and
/proc/sys/net/ipv4/tcp_rmem.
Sysfs, unlike sysctl, allows us to set and manage the files' permissions and
owners.
Signed-off-by: Robert Love <rlove@google.com>
With CONFIG_ANDROID_PARANOID_NETWORK, require specific uids/gids to instantiate
network sockets.
Signed-off-by: Robert Love <rlove@google.com>
paranoid networking: Use in_egroup_p() to check group membership
The previous group_search() caused trouble for partners with module builds.
in_egroup_p() is also cleaner.
Signed-off-by: Nick Pelly <npelly@google.com>
Fix 2.6.29 build.
Signed-off-by: Arve Hjønnevåg <arve@android.com>
net: Fix compilation of the IPv6 module
Fix compilation of the IPv6 module -- current->euid does not exist anymore,
current_euid() is what needs to be used.
Signed-off-by: Steinar H. Gunderson <sesse@google.com>
net: bluetooth: Remove the AID_NET_BT* gid numbers
Removed bluetooth checks for AID_NET_BT and AID_NET_BT_ADMIN
which are not useful anymore.
This is in preparation for getting rid of all the AID_* gids.
Signed-off-by: JP Abgrall <jpa@google.com>
[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.
This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 40ba330227ad00b8c0cdf2f425736ff9549cc423 ]
Commit acf8dd0a9d ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().
In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 83d15e70c4d8909d722c0d64747d8fb42e38a48f ]
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).
tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.
Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().
To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:
1) The proportional reduction mode
inflight > ssthresh > 0
2) The reduction bound mode
a) inflight == ssthresh > 0
b) inflight < ssthresh
sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh
Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.
In reality this bug is triggered only with a sequence of less common
events. For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.
Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>