Commit graph

7311 commits

Author SHA1 Message Date
Hans Westgaard Ry
1bec5f4066 net:Add sysctl_max_skb_frags
[ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:05 -08:00
Eric Dumazet
2679161c77 tcp: do not drop syn_recv on all icmp reports
[ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]

Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:05 -08:00
Eric Dumazet
87e40d8d8b tcp: beware of alignments in tcp_get_info()
[ Upstream commit ff5d749772018602c47509bdc0093ff72acd82ec ]

With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.

It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.

Current iproute2 package does not trigger this particular issue.

Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:05 -08:00
Joe Stringer
649dc6c32e inet: frag: Always orphan skbs inside ip_defrag()
[ Upstream commit 8282f27449bf15548cb82c77b6e04ee0ab827bdc ]

Later parts of the stack (including fragmentation) expect that there is
never a socket attached to frag in a frag_list, however this invariant
was not enforced on all defrag paths. This could lead to the
BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
end of this commit message.

While the call could be added to openvswitch to fix this particular
error, the head and tail of the frags list are already orphaned
indirectly inside ip_defrag(), so it seems like the remaining fragments
should all be orphaned in all circumstances.

kernel BUG at net/ipv4/ip_output.c:586!
[...]
Call Trace:
 <IRQ>
 [<ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
 [<ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch]
 [<ffffffff81667830>] ? dst_discard_out+0x20/0x20
 [<ffffffff81667810>] ? dst_ifdown+0x80/0x80
 [<ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
 [<ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch]
 [<ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch]
 [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch]
 [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
 [<ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
 [<ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch]
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [<ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch]
 [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
 [<ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0
 [<ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0
 [<ffffffff81660f20>] __dev_queue_xmit+0x800/0x930
 [<ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930
 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
 [<ffffffff81669876>] ? neigh_resolve_output+0x106/0x220
 [<ffffffff81661060>] dev_queue_xmit+0x10/0x20
 [<ffffffff816698e8>] neigh_resolve_output+0x178/0x220
 [<ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590
 [<ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590
 [<ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590
 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
 [<ffffffff816ab4c0>] ip_output+0x70/0x110
 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
 [<ffffffff816abf89>] ip_send_skb+0x19/0x40
 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
 [<ffffffff816df21a>] icmp_push_reply+0xea/0x120
 [<ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230
 [<ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70
 [<ffffffff816dfa06>] icmp_echo+0x36/0x70
 [<ffffffff816e0d11>] icmp_rcv+0x271/0x450
 [<ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0
 [<ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0
 [<ffffffff816a5160>] ip_local_deliver+0x60/0xd0
 [<ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560
 [<ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560
 [<ffffffff816a5453>] ip_rcv+0x283/0x3e0
 [<ffffffff810b6302>] ? match_held_lock+0x192/0x200
 [<ffffffff816a4620>] ? inet_del_offload+0x40/0x40
 [<ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0
 [<ffffffff8165e68e>] ? process_backlog+0x8e/0x230
 [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
 [<ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60
 [<ffffffff8165e678>] process_backlog+0x78/0x230
 [<ffffffff8165e6dd>] ? process_backlog+0xdd/0x230
 [<ffffffff8165e355>] net_rx_action+0x155/0x400
 [<ffffffff8106b48c>] __do_softirq+0xcc/0x420
 [<ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590
 [<ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30
 <EOI>
 [<ffffffff8106b88e>] do_softirq+0x4e/0x60
 [<ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0
 [<ffffffff816a8eb0>] ip_finish_output2+0x240/0x590
 [<ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0
 [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
 [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
 [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
 [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
 [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
 [<ffffffff816ab4c0>] ip_output+0x70/0x110
 [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
 [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
 [<ffffffff816abf89>] ip_send_skb+0x19/0x40
 [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
 [<ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30
 [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
 [<ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0
 [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
 [<ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0
 [<ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0
 [<ffffffff8163e398>] sock_sendmsg+0x38/0x50
 [<ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270
 [<ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320
 [<ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460
 [<ffffffff81204886>] ? __fget_light+0x66/0x90
 [<ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80
 [<ffffffff8163f932>] SyS_sendmsg+0x12/0x20
 [<ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
RIP  [<ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0
 RSP <ffff88006d603170>

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:04 -08:00
Eric Dumazet
e5abc10d19 tcp: fix NULL deref in tcp_v4_send_ack()
[ Upstream commit e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]

Neal reported crashes with this stack trace :

 RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
...
 CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
...
  [<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
  [<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
  [<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
  [<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
  [<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
  [<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
  [<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
  [<ffffffff8c17623e>] sock_aio_read+0xee/0x110
  [<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
  [<ffffffff8c0673a1>] SyS_read+0x1e1/0x290
  [<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b

The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.

Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.

Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).

IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:04 -08:00
Brian Norris
896d0a6219 ANDROID: net: fix 'const' warnings
See the following build log splats. The sock_i_uid() helper doesn't
quite treat the parameter as 'const' (it acquires a member lock), but
this cast is the same approach taken by other callers in this file, so I
don't feel too bad about the fix.

  CC      net/ipv4/inet_connection_sock.o
  CC      net/ipv6/inet6_connection_sock.o
net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_route_req’:
net/ipv6/inet6_connection_sock.c:89:2: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/linux/tcp.h:22:0,
                 from include/linux/ipv6.h:73,
                 from net/ipv6/inet6_connection_sock.c:18:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
net/ipv4/inet_connection_sock.c: In function ‘inet_csk_route_req’:
net/ipv4/inet_connection_sock.c:423:7: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/net/inet_sock.h:27:0,
                 from include/net/inet_connection_sock.h:23,
                 from net/ipv4/inet_connection_sock.c:19:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’
net/ipv4/inet_connection_sock.c: In function ‘inet_csk_route_child_sock’:
net/ipv4/inet_connection_sock.c:460:7: warning: passing argument 1 of ‘sock_i_uid’ discards ‘const’ qualifier from pointer target type [enabled by default]
In file included from include/net/inet_sock.h:27:0,
                 from include/net/inet_connection_sock.h:23,
                 from net/ipv4/inet_connection_sock.c:19:
include/net/sock.h:1689:8: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’

Change-Id: I5c156fc1a81f90323717bffd93c31d205b85620c
Signed-off-by: Brian Norris <briannorris@google.com>
2016-03-01 11:59:28 -08:00
Eric Dumazet
0cf03042cf net: diag: support v4mapped sockets in inet_diag_find_one_icsk()
Lorenzo reported that we could not properly find v4mapped sockets
in inet_diag_find_one_icsk(). This patch fixes the issue.

[cherry-pick of fc439d9489479411fbf9bbbec2c768df89e85503]

Change-Id: I13515e83fb76d4729f00047f9eb142c929390fb2
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
2016-02-25 09:01:21 +09:00
Lorenzo Colitti
15248ad94d net: tcp: deal with listen sockets properly in tcp_abort.
When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.

Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.

Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.

[cherry-pick of net-next 2010b93e9317cc12acd20c4aed385af7f9d1681e]

Change-Id: I0555a142f11d8b36362ffd7c8ef4a5ecae8987c9
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Eric Dumazet
13e7d45b6a tcp: diag: add support for request sockets to tcp_abort()
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.

Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.

[cherry-pick of net-next 07f6f4a31e5a8dee67960fc07bb0b37c5f879d4d]

Change-Id: I8ec6b2e6ec24f330a69595abf1d5469ace79b3fd
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Lorenzo Colitti
79170d8d5d net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.

[cherry-pick of net-next c1e64e298b8cad309091b95d8436a0255c84f54a]

Change-Id: I728a01ef03f2ccfb9016a3f3051ef00975980e49
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Lorenzo Colitti
c4c2b94c17 net: diag: Support SOCK_DESTROY for inet sockets.
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.

Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.

[backport of net-next 6eb5d2e08f071c05ecbe135369c9ad418826cab2]

Change-Id: Idc13a7def20f492a5323ad2f8de105426293bd37
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Lorenzo Colitti
6b3d21c649 net: diag: split inet_diag_dump_one_icsk into two
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.

[cherry-pick of net-next b613f56ec9baf30edf5d9d607b822532a273dad7]

Change-Id: I144765afb6ff1cd66eb4757c9418112fb0b08a6f
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 09:01:21 +09:00
Tushar Behera
3823c8136f tcp: Fix IPV6 module build errors
If CONFIG_IPV6=m is selected, we are getting following build errors.

net/built-in.o: In function `tcp_is_local6':
net/ipv4/tcp.c:3261: undefined reference to `rt6_lookup'

Making the code conditional upon only CONFIG_IPV6=y fixes this issue.

Also export tcp_nuke_addr to build IPv6 modules. Otherwise
we run into following build error:

  CC [M]  lib/zlib_deflate/deftree.o
  CC [M]  lib/zlib_deflate/deflate_syms.o
  LD [M]  lib/zlib_deflate/zlib_deflate.o
  Building modules, stage 2.
  MODPOST 46 modules
ERROR: "tcp_nuke_addr" [net/ipv6/ipv6.ko] undefined!
make[2]: *** [__modpost] Error 1

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: John Stultz <john.stultz@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:51:45 -08:00
Lorenzo Colitti
8bf4413b4f Don't kill IPv4 sockets when killing IPv6 sockets was requested.
c7c3ec4903d32c60423ee013d96e94602f66042c cherry-picked the
tcp_nuke_addr ioctl, but omitted a check that ensures that a
socket is an IPv6 socket. This makes it so that if we issue a
SIOCKILLADDR on ::, it kills IPv4 sockets as well.

This is because every IPv4 socket has an IPv6 source address
(sk_v6_rcv_saddr) of ::. Thus, when we iterate over an IPv4
socket, and compare the source address of the socket to the
source address in the ioctl, it matches the :: that was passed
in, and we kill the socket.

Change-Id: I736431a898e6ec91536536d352936a210aa10100
2016-02-16 13:51:44 -08:00
Dmitry Shmidt
dcc8ef4435 tcp: fix tcp_default_init_rwnd() for 4.1
Change-Id: If3ecf5f59acf379ffcc468f28434830a92b0383d
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-02-16 13:51:44 -08:00
JP Abgrall
72fc8653dc tcp: add a sysctl to config the tcp_default_init_rwnd
The default initial rwnd is hardcoded to 10.

Now we allow it to be controlled via
  /proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100

This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.

See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf

Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>

Conflicts:
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/tcp_input.c
2016-02-16 13:51:43 -08:00
Sreeram Ramachandran
ad49351038 net: core: Handle 'sk' being NULL in UID-based routing
It has Amit Pundir <amit.pundir@linaro.org> fix:
net: core: fix UID-based routing build

Bug: 15413527
Change-Id: Iab1fae9da6053b284591628ef1de878761b137b1
Signed-off-by: Sreeram Ramachandran <sreeram@google.com>
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16 13:51:37 -08:00
Lorenzo Colitti
fd2cf795f3 net: core: Support UID-based routing.
This contains the following commits:

1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
    This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
    IPv4 PMTUD already does this because ipv4_sk_update_pmtu
    uses __build_flow_key, which includes the UID.

Bug: 15413527
Change-Id: Iae3d4ca3979d252b6cec989bdc1a6875f811f03a
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2016-02-16 13:51:37 -08:00
John Stultz
8a8ad1cf65 net: kuid/kguid build fixes
Small build fixes for xt_quota2 and ipv4 changes

Change-Id: Ib098768040c8875887b2081c3165a6c83b37e180
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-02-16 13:51:32 -08:00
Mike Chan
6b6d5fbf9a misc: uidstat: Adding uid stat driver to collect network statistics.
Signed-off-by: Mike Chan <mike@android.com>
2016-02-16 13:51:17 -08:00
Dmitry Torokhov
08f7c4280c net: fix crash in tcp_nuke_addr()
When iterating through sockets we need to skip sockets in TIME_WAIT
state as they use lightweight structure inet_timewait_sock that does not
have sk_lock member, and if we try to lock them we'll crash thusly:

[   89.376383] BUG: spinlock lockup suspected on CPU#0, netd/431
[   89.382139]  lock: 0xffffffc039d05070, .magic: 66d30606, .owner: /-1682098992, .owner_cpu: 0
[   89.390598] CPU: 0 PID: 431 Comm: netd Tainted: G     U  W 3.18.0 #5
[   89.397389] Hardware name: Google Tegra210 Smaug Rev 1+ (DT)
[   89.403049] Call trace:
[   89.405501] [<ffffffc0002072b4>] dump_backtrace+0x0/0x10c
[   89.410918] [<ffffffc0002073d0>] show_stack+0x10/0x1c
[   89.415971] [<ffffffc000a88608>] dump_stack+0x74/0x94
[   89.421018] [<ffffffc000257e8c>] spin_dump+0x78/0x88
[   89.425984] [<ffffffc0002580d8>] do_raw_spin_lock+0xfc/0x158
[   89.431666] [<ffffffc000a90090>] _raw_spin_lock+0x34/0x44
[   89.437059] [<ffffffc0009509a8>] tcp_nuke_addr+0x1fc/0x29c
[   89.442548] [<ffffffc0009735f4>] devinet_ioctl+0x288/0x680
[   89.448053] [<ffffffc000975004>] inet_ioctl+0xc4/0xf4
[   89.453103] [<ffffffc0008baedc>] sock_do_ioctl+0x2c/0x5c
[   89.458408] [<ffffffc0008bbb54>] sock_ioctl+0x210/0x230
[   89.463633] [<ffffffc000317088>] do_vfs_ioctl+0x4ac/0x590
[   89.469049] [<ffffffc0003171c8>] SyS_ioctl+0x5c/0x88

(or with NULL pointer dereference if lockdep is still working).

Change-Id: I07c70d9a60b125b1070ff05c4eec27daee1a3e90
Signed-off-by: Dmitry Torokhov <dtor@google.com>
2016-02-16 13:51:16 -08:00
Dmitry Torokhov
4747299b2c net: fix iterating over hashtable in tcp_nuke_addr()
The actual size of the tcp hashinfo table is tcp_hashinfo.ehash_mask + 1
so we need to adjust the loop accordingly to get the sockets hashed into
the last bucket.

Change-Id: I796b3c7b4a1a7fa35fba9e5192a4a403eb6e17de
Signed-off-by: Dmitry Torokhov <dtor@google.com>
2016-02-16 13:51:16 -08:00
Robert Love
38f0ec724f net: socket ioctl to reset connections matching local address
Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.

Signed-off-by: Brian Swetland <swetland@google.com>

net: fix tcp_v4_nuke_addr

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.

We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix race in tcp_v4_nuke_addr().

To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.

Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.

Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>

ipv4: disable bottom halves around call to tcp_done().

Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>

ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr

When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close.  When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.

This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.

Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16 13:51:15 -08:00
Robert Love
6d4380a198 sysfs_net_ipv4: Add sysfs-based knobs for controlling TCP window size
Add a family of knobs to /sys/kernel/ipv4 for controlling the TCP window size:

	tcp_wmem_min
	tcp_wmem_def
	tcp_wmem_max
	tcp_rmem_min
	tcp_rmem_def
	tcp_rmem_max

This six values mirror the sysctl knobs in /proc/sys/net/ipv4/tcp_wmem and
/proc/sys/net/ipv4/tcp_rmem.

Sysfs, unlike sysctl, allows us to set and manage the files' permissions and
owners.

Signed-off-by: Robert Love <rlove@google.com>
2016-02-16 13:51:15 -08:00
Chia-chi Yeh
cc04396b84 net: Replace AID_NET_RAW checks with capable(CAP_NET_RAW).
Signed-off-by: Chia-chi Yeh <chiachi@android.com>
2016-02-16 13:51:14 -08:00
Robert Love
4b96f2c640 Paranoid network.
With CONFIG_ANDROID_PARANOID_NETWORK, require specific uids/gids to instantiate
network sockets.

Signed-off-by: Robert Love <rlove@google.com>

paranoid networking: Use in_egroup_p() to check group membership

The previous group_search() caused trouble for partners with module builds.
in_egroup_p() is also cleaner.

Signed-off-by: Nick Pelly <npelly@google.com>

Fix 2.6.29 build.

Signed-off-by: Arve Hjønnevåg <arve@android.com>

net: Fix compilation of the IPv6 module

Fix compilation of the IPv6 module -- current->euid does not exist anymore,
current_euid() is what needs to be used.

Signed-off-by: Steinar H. Gunderson <sesse@google.com>

net: bluetooth: Remove the AID_NET_BT* gid numbers

Removed bluetooth checks for AID_NET_BT and AID_NET_BT_ADMIN
which are not useful anymore.
This is in preparation for getting rid of all the AID_* gids.

Signed-off-by: JP Abgrall <jpa@google.com>
2016-02-16 13:51:13 -08:00
Konstantin Khlebnikov
1f0bdf6092 net: preserve IP control block during GSO segmentation
[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:00 -08:00
Michal Kubeček
b856abbb45 udp: disallow UFO for sockets with SO_NO_CHECK option
[ Upstream commit 40ba330227ad00b8c0cdf2f425736ff9549cc423 ]

Commit acf8dd0a9d ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:00 -08:00
Neal Cardwell
5f15ae009e tcp_yeah: don't set ssthresh below 2
[ Upstream commit 83d15e70c4d8909d722c0d64747d8fb42e38a48f ]

For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).

tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.

Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:28:59 -08:00
Yuchung Cheng
8b8a321ff7 tcp: fix zero cwnd in tcp_cwnd_reduction
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().

To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:

1) The proportional reduction mode
   inflight > ssthresh > 0

2) The reduction bound mode
  a) inflight == ssthresh > 0

  b) inflight < ssthresh
     sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.

In reality this bug is triggered only with a sequence of less common
events.  For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.

Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:39:56 -05:00
David Ahern
b5bdacf3bb net: Propagate lookup failure in l3mdev_get_saddr to caller
Commands run in a vrf context are not failing as expected on a route lookup:
    root@kenny:~# ip ro ls table vrf-red
    unreachable default

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    ping: Warning: source address might be selected on device other than vrf-red.
    PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.

    --- 10.100.1.254 ping statistics ---
    2 packets transmitted, 0 received, 100% packet loss, time 999ms

Since the vrf table does not have a route for 10.100.1.254 the ping
should have failed. The saddr lookup causes a full VRF table lookup.
Propogating a lookup failure to the user allows the command to fail as
expected:

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    connect: No route to host

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:58:30 -05:00
David S. Miller
024f35c552 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-12-22

Just one patch to fix dst_entries_init with multiple namespaces.
From Dan Streetman.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 16:26:31 -05:00
Pravin B Shelar
6d3c348a63 ipip: ioctl: Remove superfluous IP-TTL handling.
IP-TTL case is already handled in ip_tunnel_ioctl() API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:07:59 -05:00
Eric Dumazet
07e100f984 tcp: restore fastopen with no data in SYN packet
Yuchung tracked a regression caused by commit 57be5bdad7 ("ip: convert
tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.

Some Fast Open users do not actually add any data in the SYN packet.

Fixes: 57be5bdad7 ("ip: convert tcp_sendmsg() to iov_iter primitives")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:37:39 -05:00
Hannes Frederic Sowa
3036facbb7 fou: clean up socket with kfree_rcu
fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.

Fixes: 23461551c0 ("fou: Support for foo-over-udp RX path")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 19:03:02 -05:00
Eric Dumazet
5037e9ef94 net: fix IP early demux races
David Wilder reported crashes caused by dst reuse.

<quote David>
  I am seeing a crash on a distro V4.2.3 kernel caused by a double
  release of a dst_entry.  In ipv4_dst_destroy() the call to
  list_empty() finds a poisoned next pointer, indicating the dst_entry
  has already been removed from the list and freed. The crash occurs
  18 to 24 hours into a run of a network stress exerciser.
</quote>

Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels

Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 23:52:00 -05:00
Hannes Frederic Sowa
79462ad02e net: add validation for the socket syscall protocol argument
郭永刚 reported that one could simply crash the kernel as root by
using a simple program:

	int socket_fd;
	struct sockaddr_in addr;
	addr.sin_port = 0;
	addr.sin_addr.s_addr = INADDR_ANY;
	addr.sin_family = 10;

	socket_fd = socket(10,3,0x40000000);
	connect(socket_fd , &addr,16);

AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.

This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.

kernel: Call Trace:
kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
kernel:  [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
kernel:  [<ffffffff81645e0e>] SyS_connect+0xe/0x10
kernel:  [<ffffffff81779515>] tracesys_phase2+0x84/0x89

I found no particular commit which introduced this problem.

CVE: CVE-2015-8543
Cc: Cong Wang <cwang@twopensource.com>
Reported-by: 郭永刚 <guoyonggang@360.cn>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 16:09:30 -05:00
David S. Miller
9e5be5bd43 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
specifically for nf_tables and nfnetlink_queue, they are:

1) Avoid a compilation warning in nfnetlink_queue that was introduced
   in the previous merge window with the simplification of the conntrack
   integration, from Arnd Bergmann.

2) nfnetlink_queue is leaking the pernet subsystem registration from
   a failure path, patch from Nikolay Borisov.

3) Pass down netns pointer to batch callback in nfnetlink, this is the
   largest patch and it is not a bugfix but it is a dependency to
   resolve a splat in the correct way.

4) Fix a splat due to incorrect socket memory accounting with nfnetlink
   skbuff clones.

5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6.

6) Traverse the nftables commit list in reverse order from the commit
   path, otherwise we crash when the user applies an incremental update
   via 'nft -f' that deletes an object that was just introduced in this
   batch, from Xin Long.

Regarding the compilation warning fix, many people have sent us (and
keep sending us) patches to address this, that's why I'm including this
batch even if this is not critical.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 11:09:01 -05:00
David Ahern
7f49e7a38b net: Flush local routes when device changes vrf association
The VRF driver cycles netdevs when an interface is enslaved or released:
the down event is used to flush neighbor and route tables and the up
event (if the interface was already up) effectively moves local and
connected routes to the proper table.

As of 4f823defdd the local route is left hanging around after a link
down, so when a netdev is moved from one VRF to another (or released
from a VRF altogether) local routes are left in the wrong table.

Fix by handling the NETDEV_CHANGEUPPER event. When the upper dev is
an L3mdev then call fib_disable_ip to flush all routes, local ones
to.

Fixes: 4f823defdd ("ipv4: fix to not remove local route on link down")
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-13 23:58:44 -05:00
Pablo Neira Ayuso
d3340b79ec netfilter: nf_dup: add missing dependencies with NF_CONNTRACK
CONFIG_NF_CONNTRACK=m
CONFIG_NF_DUP_IPV4=y

results in:

   net/built-in.o: In function `nf_dup_ipv4':
>> (.text+0xd434f): undefined reference to `nf_conntrack_untracked'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-10 18:17:06 +01:00
Andrew Lunn
4eba7bb1d7 ipv4: igmp: Allow removing groups from a removed interface
When a multicast group is joined on a socket, a struct ip_mc_socklist
is appended to the sockets mc_list containing information about the
joined group.

If the interface is hot unplugged, this entry becomes stale. Prior to
commit 52ad353a53 ("igmp: fix the problem when mc leave group") it
was possible to remove the stale entry by performing a
IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
the interface. However, this fix enforces that the interface must
still exist. Thus with time, the number of stale entries grows, until
sysctl_igmp_max_memberships is reached and then it is not possible to
join and more groups.

The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
performed without specifying the interface, either by ifindex or ip
address. However here we do supply one of these. So loosen the
restriction on device existence to only apply when the interface has
not been specified. This then restores the ability to clean up the
stale entries.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 52ad353a53 "(igmp: fix the problem when mc leave group")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 12:07:05 -05:00
Eric Dumazet
9cd3e072b0 net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
Eric Dumazet
142a2e7ece tcp: initialize tp->copied_seq in case of cross SYN connection
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :

WARN_ON(tp->copied_seq != tp->rcv_nxt &&
        !(flags & (MSG_PEEK | MSG_TRUNC)));

His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.

Thanks again Dmitry for your report and testings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30 15:34:17 -05:00
Nikolay Aleksandrov
fbdd29bfd2 net: ipmr, ip6mr: fix vif/tunnel failure race condition
Since (at least) commit b17a7c179d ("[NET]: Do sysfs registration as
part of register_netdevice."), netdev_run_todo() deals only with
unregistration, so we don't need to do the rtnl_unlock/lock cycle to
finish registration when failing pimreg or dvmrp device creation. In
fact that opens a race condition where someone can delete the device
while rtnl is unlocked because it's fully registered. The problem gets
worse when netlink support is introduced as there are more points of entry
that can cause it and it also makes reusing that code correctly impossible.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 17:15:56 -05:00
Nikolay Aleksandrov
0e615e9601 net: ipmr: fix static mfc/dev leaks on table destruction
When destroying an mrt table the static mfc entries and the static
devices are kept, which leads to devices that can never be destroyed
(because of refcnt taken) and leaked memory, for example:
unreferenced object 0xffff880034c144c0 (size 192):
  comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
  hex dump (first 32 bytes):
    98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
    ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
  backtrace:
    [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
    [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
    [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
    [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
    [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
    [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
    [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
    [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
    [<ffffffffffffffff>] 0xffffffffffffffff

Make sure that everything is cleaned on netns destruction.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-22 20:44:46 -05:00
Eric Dumazet
5d4c9bfbab tcp: fix potential huge kmalloc() calls in TCP_REPAIR
tcp_send_rcvq() is used for re-injecting data into tcp receive queue.

Problems :

- No check against size is performed, allowed user to fool kernel in
  attempting very large memory allocations, eventually triggering
  OOM when memory is fragmented.

- In case of fault during the copy we do not return correct errno.

Lets use alloc_skb_with_frags() to cook optimal skbs.

Fixes: 292e8d8c85 ("tcp: Move rcvq sending to tcp_input.c")
Fixes: c0e88ff0f2 ("tcp: Repair socket queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:57:33 -05:00
Yuchung Cheng
dd52bc2b4e tcp: fix Fast Open snmp over-counting bug
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Yuchung Cheng
0e45f4da59 tcp: disable Fast Open on timeouts after handshake
Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Eric Dumazet
1b8e6a01e1 tcp: md5: fix lockdep annotation
When a passive TCP is created, we eventually call tcp_md5_do_add()
with sk pointing to the child. It is not owner by the user yet (we
will add this socket into listener accept queue a bit later anyway)

But we do own the spinlock, so amend the lockdep annotation to avoid
following splat :

[ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
[ 8451.090932]
[ 8451.090932] other info that might help us debug this:
[ 8451.090932]
[ 8451.090934]
[ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
[ 8451.090936] 3 locks held by socket_sockopt_/214795:
[ 8451.090936]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
[ 8451.090947]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
[ 8451.090952]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
[ 8451.090958]
[ 8451.090958] stack backtrace:
[ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_

[ 8451.091215] Call Trace:
[ 8451.091216]  <IRQ>  [<ffffffff856fb29c>] dump_stack+0x55/0x76
[ 8451.091229]  [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
[ 8451.091235]  [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
[ 8451.091239]  [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
[ 8451.091242]  [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
[ 8451.091246]  [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
[ 8451.091249]  [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
[ 8451.091253]  [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
[ 8451.091256]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
[ 8451.091260]  [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
[ 8451.091263]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
[ 8451.091267]  [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
[ 8451.091270]  [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
[ 8451.091273]  [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
[ 8451.091277]  [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90

Fixes: a8afca0329 ("tcp: md5: protects md5sig_info with RCU")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 23:02:47 -05:00
stephen hemminger
945fae44d3 udp: remove duplicate include
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 14:58:02 -05:00