Commit graph

5891 commits

Author SHA1 Message Date
Wei-Chun Chao
7a7ffbabf9 ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC
VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).

Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.

(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.

Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 19:06:47 -05:00
Weilong Chen
dd9b45598a ipv4: switch and case should be at the same indent
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Weilong Chen
442c67f844 ipv4: spaces required around that '='
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 03:30:36 -05:00
Salam Noureddine
56022a8fdd ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set
Gratuitous arp packets are useful in switchover scenarios to update
client arp tables as quickly as possible. Currently, the mac address
of a neighbour is only updated after a locktime period has elapsed
since the last update. In most use cases such delays are unacceptable
for network admins. Moreover, the "updated" field of the neighbour
stucture doesn't record the last time the address of a neighbour
changed but records any change that happens to the neighbour. This is
clearly a bug since locktime uses that field as meaning "addr_updated".
With this observation, I was able to perpetuate a stale address by
sending a stream of gratuitous arp packets spaced less than locktime
apart. With this change the address is updated when a gratuitous arp
is received and the arp_accept sysctl is set.

Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 00:08:38 -05:00
Pablo Neira Ayuso
d497c63527 netfilter: add help information to new nf_tables Kconfig options
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-01 18:37:10 +01:00
Eric Leblond
bee11dc78f netfilter: nft_reject: support for IPv6 and TCP reset
This patch moves nft_reject_ipv4 to nft_reject and adds support
for IPv6 protocol. This patch uses functions included in nf_reject.h
to implement reject by TCP reset.

The code has to be build as a module if NF_TABLES_IPV6 is also a
module to avoid compilation error due to usage of IPv6 functions.
This has been done in Kconfig by using the construct:

 depends on NF_TABLES_IPV6 || !NF_TABLES_IPV6

This seems a bit weird in terms of syntax but works perfectly.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-30 18:15:38 +01:00
Eric Leblond
cc70d069e2 netfilter: REJECT: separate reusable code
This patch prepares the addition of TCP reset support in
the nft_reject module by moving reusable code into a header
file.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-30 15:04:41 +01:00
stephen hemminger
24245a1b05 lro: remove dead code
Remove leftover code that is not used anywhere in current tree.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-29 16:34:25 -05:00
stephen hemminger
f7e56a76ac tcp: make local functions static
The following are only used in one file:
  tcp_connect_init
  tcp_set_rto

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-29 16:34:24 -05:00
Stephen Hemminger
ea074b3495 ipv4: ping make local stuff static
Don't export ping_table or ping_v4_sendmsg. Both are only used
inside ping code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-28 17:05:45 -05:00
Stephen Hemminger
068a6e1834 ipv4: remove unused function
inetpeer_invalidate_family defined but never used

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-28 17:03:20 -05:00
Stephen Hemminger
7195cf7221 arp: make arp_invalidate static
Don't export arp_invalidate, only used in arp.c

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-28 17:02:46 -05:00
Stephen Hemminger
c9cb6b6ec1 ipv4: make fib_detect_death static
Make fib_detect_death function static only used in one file.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-28 17:01:46 -05:00
Weilong Chen
49564e5516 ipv4: ipv4: Cleanup the comments in tcp_yeah.c
This cleanup the comments in tcp_yeah.c.
1.The old link is dead,use a new one to instead.
2.'lin' add nothing useful,remove it.
3.do not use C99 // comments.

Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:55 -05:00
Weilong Chen
5797deb657 ipv4: ERROR: code indent should use tabs where possible
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Weilong Chen
47d18a9be1 ipv4: ERROR: do not initialise globals to 0 or NULL
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Weilong Chen
c71151f05b ipv4: fix all space errors in file igmp.c
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Weilong Chen
d41db5af26 ipv4: fix checkpatch error with foo * bar
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Weilong Chen
0c9a67d2ed ipv4: fix checkpatch error "space prohibited"
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Weilong Chen
a22318e83b ipv4: do clean up with spaces
Fix checkpatch errors like:
ERROR: spaces required around that XXX

Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26 13:43:21 -05:00
Hannes Frederic Sowa
61e7f09d0f ipv4: consistent reporting of pmtu data in case of corking
We report different pmtu values back on the first write and on further
writes on an corked socket.

Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.

Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-22 18:52:09 -05:00
stephen hemminger
09aea5df7f netconf: rename PROXY_ARP to NEIGH_PROXY
Use same field for both IPv4 (proxy_arp) and IPv6 (proxy_ndp)
so fix it before API is set to be a common name

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-22 18:02:43 -05:00
Eric Dumazet
a181ceb501 tcp: autocork should not hold first packet in write queue
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.

We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.

Fix this by always pushing the skb if it is at the head of write queue.

Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.

erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~#  ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)

Before patch :

18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39

After patch :

18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-20 17:56:25 -05:00
David S. Miller
1669cb9855 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2013-12-19

1) Use the user supplied policy index instead of a generated one
   if present. From Fan Du.

2) Make xfrm migration namespace aware. From Fan Du.

3) Make the xfrm state and policy locks namespace aware. From Fan Du.

4) Remove ancient sleeping when the SA is in acquire state,
   we now queue packets to the policy instead. This replaces the
   sleeping code.

5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the
   posibility to sleep. The sleeping code is gone, so remove it.

6) Check user specified spi for IPComp. Thr spi for IPcomp is only
   16 bit wide, so check for a valid value. From Fan Du.

7) Export verify_userspi_info to check for valid user supplied spi ranges
   with pfkey and netlink. From Fan Du.

8) RFC3173 states that if the total size of a compressed payload and the IPComp
   header is not smaller than the size of the original payload, the IP datagram
   must be sent in the original non-compressed form. These packets are dropped
   by the inbound policy check because they are not transformed. Document the need
   to set 'level use' for IPcomp to receive such packets anyway. From Fan Du.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 18:37:49 -05:00
Daniel Borkmann
b1aac815c0 net: inet_diag: zero out uninitialized idiag_{src,dst} fields
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].

In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...

  rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);

... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:

  r->id.idiag_src[0] = inet->inet_rcv_saddr;
  r->id.idiag_dst[0] = inet->inet_daddr;

struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.

So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).

Similarly, fix also other places where we fill that out.

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 14:55:52 -05:00
Timo Teräs
0e3da5bb8d ip_gre: fix msg_name parsing for recvfrom/recvmsg
ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.

Bug introduced in commit c544193214 (GRE: Refactor GRE tunneling code.)

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 17:44:33 -05:00
Hannes Frederic Sowa
cd174e67a6 ipv4: new ip_no_pmtu_disc mode to always discard incoming frag needed msgs
This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:58:20 -05:00
Hannes Frederic Sowa
974eda11c5 inet: make no_pmtu_disc per namespace and kill ipv4_config
The other field in ipv4_config, log_martians, was converted to a
per-interface setting, so we can just remove the whole structure.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:58:20 -05:00
David S. Miller
143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
Tom Herbert
7539fadcb8 net: Add utility functions to clear rxhash
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb.  This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash.  This patch adds an explict function to clear
all the rxhash related information in the skb properly.

skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.

Fixed up places where 'skb->rxhash = 0' was being called.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Eric Dumazet
d4589926d7 tcp: refine TSO splits
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We can avoid this split by including the Nagle tests at the right place.

Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.

tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>

Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized
         8,449,968 context-switches          #    0.041 M/sec
         1,935,997 CPU-migrations            #    0.009 M/sec
           160,541 page-faults               #    0.780 K/sec
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized
         5,635,458 context-switches          #    0.052 M/sec
         1,374,707 CPU-migrations            #    0.013 M/sec
           160,920 page-faults               #    0.001 M/sec
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.

Many thanks to Neal for review and ideas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:15:25 -05:00
David S. Miller
7089fdd814 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains two Netfilter fixes for your net
tree, they are:

* Fix endianness in nft_reject, the NFTA_REJECT_TYPE netlink attributes
  was not converted to network byte order as needed by all nfnetlink
  subsystems, from Eric Leblond.

* Restrict SYNPROXY target to INPUT and FORWARD chains, this avoid a
  possible crash due to misconfigurations, from Patrick McHardy.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:06:20 -05:00
Eric Dumazet
e47eb5dfb2 udp: ipv4: do not use sk_dst_lock from softirq context
Using sk_dst_lock from softirq context is not supported right now.

Instead of adding BH protection everywhere,
udp_sk_rx_dst_set() can instead use xchg(), as suggested
by David.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 9750223102 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 14:50:58 -05:00
stephen hemminger
f085ff1c13 netconf: add proxy-arp support
Add support to netconf to show changes to proxy-arp status on a per
interface basis via netlink in a manner similar to forwarding
and reverse path state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14 00:58:22 -05:00
Jerry Chu
299603e837 net-gro: Prepare GRO stack for the upcoming tunneling support
This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.

With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.

Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12 13:47:53 -05:00
Eric Leblond
a3adadf301 netfilter: nft_reject: fix endianness in dump function
The dump function in nft_reject_ipv4 was not converting a u32
field to network order before sending it to userspace, this
needs to happen for consistency with other nf_tables and
nfnetlink subsystems.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-12 09:37:39 +01:00
Eric Dumazet
9750223102 udp: ipv4: must add synchronization in udp_sk_rx_dst_set()
Unlike TCP, UDP input path does not hold the socket lock.

Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise
multiple cpus could leak a refcount.

This patch also takes care of renewing a stale dst entry.
(When the sk->sk_rx_dst would not be used by IP early demux)

Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 20:21:10 -05:00
Eric Dumazet
610438b744 udp: ipv4: fix potential use after free in udp_v4_early_demux()
pskb_may_pull() can reallocate skb->head, we need to move the
initialization of iph and uh pointers after its call.

Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 16:10:14 -05:00
Nicolas Dichtel
b601fa197f ipv4: fix wildcard search with inet_confirm_addr()
Help of this function says: "in_dev: only on this interface, 0=any interface",
but since commit 39a6d06300 ("[NETNS]: Process inet_confirm_addr in the
correct namespace."), the code supposes that it will never be NULL. This
function is never called with in_dev == NULL, but it's exported and may be used
by an external module.

Because this patch restore the ability to call inet_confirm_addr() with in_dev
== NULL, I partially revert the above commit, as suggested by Julian.

CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 14:47:40 -05:00
Patrick McHardy
f01b3926ee netfilter: SYNPROXY target: restrict to INPUT/FORWARD
Fix a crash in synproxy_send_tcp() when using the SYNPROXY target in the
PREROUTING chain caused by missing routing information.

Reported-by: Nicki P. <xastx@gmx.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-11 11:30:25 +01:00
Eric Dumazet
8afdd99a13 udp: ipv4: fix an use after free in __udp4_lib_rcv()
Dave Jones reported a use after free in UDP stack :

[ 5059.434216] =========================
[ 5059.434314] [ BUG: held lock freed! ]
[ 5059.434420] 3.13.0-rc3+ #9 Not tainted
[ 5059.434520] -------------------------
[ 5059.434620] named/863 is freeing memory ffff88005e960000-ffff88005e96061f, with a lock still held there!
[ 5059.434815]  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435012] 3 locks held by named/863:
[ 5059.435086]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8143054d>] __netif_receive_skb_core+0x11d/0x940
[ 5059.435295]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81467a5e>] ip_local_deliver_finish+0x3e/0x410
[ 5059.435500]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435734]
stack backtrace:
[ 5059.435858] CPU: 0 PID: 863 Comm: named Not tainted 3.13.0-rc3+ #9 [loadavg: 0.21 0.06 0.06 1/115 1365]
[ 5059.436052] Hardware name:                  /D510MO, BIOS MOPNV10J.86A.0175.2010.0308.0620 03/08/2010
[ 5059.436223]  0000000000000002 ffff88007e203ad8 ffffffff8153a372 ffff8800677130e0
[ 5059.436390]  ffff88007e203b10 ffffffff8108cafa ffff88005e960000 ffff88007b00cfc0
[ 5059.436554]  ffffea00017a5800 ffffffff8141c490 0000000000000246 ffff88007e203b48
[ 5059.436718] Call Trace:
[ 5059.436769]  <IRQ>  [<ffffffff8153a372>] dump_stack+0x4d/0x66
[ 5059.436904]  [<ffffffff8108cafa>] debug_check_no_locks_freed+0x15a/0x160
[ 5059.437037]  [<ffffffff8141c490>] ? __sk_free+0x110/0x230
[ 5059.437147]  [<ffffffff8112da2a>] kmem_cache_free+0x6a/0x150
[ 5059.437260]  [<ffffffff8141c490>] __sk_free+0x110/0x230
[ 5059.437364]  [<ffffffff8141c5c9>] sk_free+0x19/0x20
[ 5059.437463]  [<ffffffff8141cb25>] sock_edemux+0x25/0x40
[ 5059.437567]  [<ffffffff8141c181>] sock_queue_rcv_skb+0x81/0x280
[ 5059.437685]  [<ffffffff8149bd21>] ? udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.437805]  [<ffffffff81499c82>] __udp_queue_rcv_skb+0x42/0x240
[ 5059.437925]  [<ffffffff81541d25>] ? _raw_spin_lock+0x65/0x70
[ 5059.438038]  [<ffffffff8149bebb>] udp_queue_rcv_skb+0x26b/0x4b0
[ 5059.438155]  [<ffffffff8149c712>] __udp4_lib_rcv+0x152/0xb00
[ 5059.438269]  [<ffffffff8149d7f5>] udp_rcv+0x15/0x20
[ 5059.438367]  [<ffffffff81467b2f>] ip_local_deliver_finish+0x10f/0x410
[ 5059.438492]  [<ffffffff81467a5e>] ? ip_local_deliver_finish+0x3e/0x410
[ 5059.438621]  [<ffffffff81468653>] ip_local_deliver+0x43/0x80
[ 5059.438733]  [<ffffffff81467f70>] ip_rcv_finish+0x140/0x5a0
[ 5059.438843]  [<ffffffff81468926>] ip_rcv+0x296/0x3f0
[ 5059.438945]  [<ffffffff81430b72>] __netif_receive_skb_core+0x742/0x940
[ 5059.439074]  [<ffffffff8143054d>] ? __netif_receive_skb_core+0x11d/0x940
[ 5059.442231]  [<ffffffff8108c81d>] ? trace_hardirqs_on+0xd/0x10
[ 5059.442231]  [<ffffffff81430d83>] __netif_receive_skb+0x13/0x60
[ 5059.442231]  [<ffffffff81431c1e>] netif_receive_skb+0x1e/0x1f0
[ 5059.442231]  [<ffffffff814334e0>] napi_gro_receive+0x70/0xa0
[ 5059.442231]  [<ffffffffa01de426>] rtl8169_poll+0x166/0x700 [r8169]
[ 5059.442231]  [<ffffffff81432bc9>] net_rx_action+0x129/0x1e0
[ 5059.442231]  [<ffffffff810478cd>] __do_softirq+0xed/0x240
[ 5059.442231]  [<ffffffff81047e25>] irq_exit+0x125/0x140
[ 5059.442231]  [<ffffffff81004241>] do_IRQ+0x51/0xc0
[ 5059.442231]  [<ffffffff81542bef>] common_interrupt+0x6f/0x6f

We need to keep a reference on the socket, by using skb_steal_sock()
at the right place.

Note that another patch is needed to fix a race in
udp_sk_rx_dst_set(), as we hold no lock protecting the dst.

Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:58:40 -05:00
stephen hemminger
8e3bff96af net: more spelling fixes
Various spelling fixes in networking stack

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 21:57:11 -05:00
Jiri Pirko
ad6c81359f ipv4: add support for IFA_FLAGS nl attribute
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 21:50:00 -05:00
Stefan Tomanek
673498b8ed inet: fix NULL pointer Oops in fib(6)_rule_suppress
This changes ensures that the routing entry investigated by the suppress
function actually does point to a device struct before following that pointer,
fixing a possible kernel oops situation when verifying the interface group
associated with a routing table entry.

According to Daniel Golle, this Oops can be triggered by a user process trying
to establish an outgoing IPv6 connection while having no real IPv6 connectivity
set up (only autoassigned link-local addresses).

Fixes: 6ef94cfafb ("fib_rules: add route suppression based on ifgroup")

Reported-by: Daniel Golle <daniel.golle@gmail.com>
Tested-by: Daniel Golle <daniel.golle@gmail.com>
Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 17:54:23 -05:00
Jiri Pirko
1d4c8c2984 neigh: restore old behaviour of default parms values
Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.

Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).

This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.

Introduced by:
commit 8030f54499
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 22 01:53:47 2007 +0900

    [IPV4] devinet: Register inetdev earlier.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
Jiri Pirko
73af614aed neigh: use tbl->family to distinguish ipv4 from ipv6
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
Jiri Pirko
1f9248e560 neigh: convert parms to an array
This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:56:12 -05:00
David S. Miller
34f9f43710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:20:14 -05:00
Eric Dumazet
f54b311142 tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.

Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)

If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.

This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.

The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.

Using FQ/pacing is a way to increase the probability of
autocorking being triggered.

Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)

Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.

Tested:

Interesting effects when using line buffered commands under ssh.

Excellent performance results in term of cpu usage and total throughput.

lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39

 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':

      35209.439626 task-clock                #    2.901 CPUs utilized
             2,294 context-switches          #    0.065 K/sec
               101 CPU-migrations            #    0.003 K/sec
             4,079 page-faults               #    0.116 K/sec
    97,923,241,298 cycles                    #    2.781 GHz                     [83.31%]
    51,832,908,236 stalled-cycles-frontend   #   52.93% frontend cycles idle    [83.30%]
    25,697,986,603 stalled-cycles-backend    #   26.24% backend  cycles idle    [66.70%]
   102,225,978,536 instructions              #    1.04  insns per cycle
                                             #    0.51  stalled cycles per insn [83.38%]
    18,657,696,819 branches                  #  529.906 M/sec                   [83.29%]
        91,679,646 branch-misses             #    0.49% of all branches         [83.40%]

      12.136204899 seconds time elapsed

lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89

 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
      40045.864494 task-clock                #    3.301 CPUs utilized
               171 context-switches          #    0.004 K/sec
                53 CPU-migrations            #    0.001 K/sec
             4,080 page-faults               #    0.102 K/sec
   111,340,458,645 cycles                    #    2.780 GHz                     [83.34%]
    61,778,039,277 stalled-cycles-frontend   #   55.49% frontend cycles idle    [83.31%]
    29,295,522,759 stalled-cycles-backend    #   26.31% backend  cycles idle    [66.67%]
   108,654,349,355 instructions              #    0.98  insns per cycle
                                             #    0.57  stalled cycles per insn [83.34%]
    19,552,170,748 branches                  #  488.244 M/sec                   [83.34%]
       157,875,417 branch-misses             #    0.81% of all branches         [83.34%]

      12.130267788 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:51:41 -05:00
Eric Dumazet
7b7fc97aa3 tcp: optimize some skb_shinfo(skb) uses
Compiler doesn't know skb_shinfo(skb) pointer is usually constant.

By using a temporary variable, we help generating smaller code.

For example, tcp_init_nondata_skb() is inlined after this patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:51:40 -05:00