commit be9f4a44e7 (ipv4: tcp: remove per net tcp_sock) added a
selinux regression, reported and bisected by John Stultz
selinux_ip_postroute_compat() expect to find a valid sk->sk_security
pointer, but this field is NULL for unicast_sock
It turns out that unicast_sock are really temporary stuff to be able
to reuse part of IP stack (ip_append_data()/ip_push_pending_frames())
Fact is that frames sent by ip_send_unicast_reply() should be orphaned
to not fool LSM.
Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
fake socket at all. I'll probably implement tcp_v4_send_response() to
remove these unicast_sock in linux-3.7
Reported-by: John Stultz <johnstul@us.ibm.com>
Bisected-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disabling PMTU discovery can increase the output packet
rate but some users have enough resources and prefer to fragment
than to drop traffic. By default, we copy the DF bit but if
pmtu_disc is disabled we do not send FRAG_NEEDED messages anymore.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
IPVS is missing the logic to update PMTU in routing
for its IPIP packets. We monitor the dst_mtu and can return
FRAG_NEEDED messages but if the tunneled packets get ICMP
error we can not rely on other traffic to save the lowest
MTU.
The following patch adds ICMP handling for IPIP
packets in incoming direction, from some remote host to
our local IP used as saddr in the outer header. By this
way we can forward any related ICMP traffic if it is for IPVS
TUN connection. For the special case of PMTUD we update the
routing and if client requested DF we can forward the
error.
To properly update the routing we have to bind
the cached route (dest->dst_cache) to the selected saddr
because ipv4_update_pmtu uses saddr for dst lookup.
Add IP_VS_RT_MODE_CONNECT flag to force such binding with
second route.
Update ip_vs_tunnel_xmit to provide IP_VS_RT_MODE_CONNECT
and change the code to copy DF. For now we prefer not to
force PMTU discovery (outer DF=1) because we don't have
configuration option to enable or disable PMTUD. As we
do not keep any packets to resend, we prefer not to
play games with packets without DF bit because the sender
is not informed when they are rejected.
Also, change ops->update_pmtu to be called only
for local clients because there is no point to update
MTU for input routes, in our case skb->dst->dev is lo.
It seems the code is copied from ipip.c where the skb
dst points to tunnel device.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Removed the following sparse warnings, wether CONFIG_SYSCTL
is defined or not:
* warning: symbol 'ip_vs_control_net_init_sysctl' was not
declared. Should it be static?
* warning: symbol 'ip_vs_control_net_cleanup_sysctl' was
not declared. Should it be static?
Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Get rid of the ftp_app pointer and allow applications
to be registered without adding fields in the netns_ipvs structure.
v2: fix coding style as suggested by Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The FTP application indirectly depends on the
nf_conntrack_ftp helper for proper NAT support. If the
module is not loaded, IPVS can resize the packets for the
command connection, eg. PASV response but the SEQ adjustment
logic in ipv4_confirm is not called without helper.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
As pointed out, there are places, that access net->loopback_dev->ifindex
and after ifindex generation is made per-net this value becomes constant
equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
it where appropriate.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Strictly speaking this is only _really_ required for checkpoint-restore to
make loopback device always have the same index.
This change appears to be safe wrt "ifindex should be unique per-system"
concept, as all the ifindex usage is either already made per net namespace
of is explicitly limited with init_net only.
There are two cool side effects of this. The first one -- ifindices of
devices in container are always small, regardless of how many containers
we've started (and re-started) so far. The second one is -- we can speed
up the loopback ifidex access as shown in the next patch.
v2: Place ifindex right after dev_base_seq : avoid two holes and use the
same cache line, dirtied in list_netdevice()/unlist_netdevice()
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various /proc/net files sometimes report crazy timer values, expressed
in clock_t units.
This happens when an expired timer delta (expires - jiffies) is passed
to jiffies_to_clock_t().
This function has an overflow in :
return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
commit cbbc719fcc (time: Change jiffies_to_clock_t() argument type
to unsigned long) only got around the problem.
As we cant output negative values in /proc/net/tcp without breaking
various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
that caps the negative delta to a 0 value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: hank <pyu@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently leak all tcp metrics at struct net dismantle time.
tcp_net_metrics_exit() frees the hash table, we must first
iterate it to free all metrics.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not leak memory by updating pointer with potentially NULL realloc return value.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Memory is allocated for 'tt_change_node' with kmalloc().
'tt_change_node' may go out of scope really being used for anything
(except have a few members initialized) if we hit the 'del:' label.
This patch makes sure we free the memory in that case.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Resending again, as the text was corrupted by the email client]
To speed up operations, QFQ internally divides classes into
groups. Which group a class belongs to depends on the ratio between
the maximum packet length and the weight of the class. Unfortunately
the function qfq_change_class lacks the steps for changing the group
of a class when the ratio max_pkt_len/weight of the class changes.
For example, when the last of the following three commands is
executed, the group of class 1:1 is not correctly changed:
tc disc add dev XXX root handle 1: qfq
tc class add dev XXX parent 1: qfq classid 1:1 weight 1
tc class change dev XXX parent 1: classid 1:1 qfq weight 4
Not changing the group of a class does not affect the long-term
bandwidth guaranteed to the class, as the latter is independent of the
maximum packet length, and correctly changes (only) if the weight of
the class changes. In contrast, if the group of the class is not
updated, the class is still guaranteed the short-term bandwidth and
packet delay related to its old group, instead of the guarantees that
it should receive according to its new weight and/or maximum packet
length. This may also break service guarantees for other classes.
This patch adds the missing operations.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating on network performance problems, I found this little
gem :
$ nm -v vmlinux | grep -1 dst_default_metrics
ffffffff82736540 b busy.46605
ffffffff82736560 B dst_default_metrics
ffffffff82736598 b dst_busy_list
Apparently, declaring a const array without initializer put it in
(writeable) bss section, in middle of possibly often dirtied cache
lines.
Since we really want dst_default_metrics be const to avoid any possible
false sharing and catch any buggy writes, I force a null initializer.
ffffffff818a4c20 R dst_default_metrics
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After IP route cache removal, I believe rcu_bh() has very little use and
we should remove this RCU variant, since it adds some cycles in fast
path.
Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
some users only assert rcu_read_lock().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quiets the sparse warning:
warning: Using plain integer as NULL pointer
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Missed rcu_assign_pointer() in mac80211 scanning, from Johannes
Berg.
2) Allow devices to limit the number of segments that an individual
TCP TSO packet can use at a time, to deal with device and/or driver
specific limitations. From Ben Hutchings.
3) Fix unexpected hard IPSEC expiration after setting the date. From
Fan Du.
4) Memory leak fix in bxn2x driver, from Jesper Juhl.
5) Fix two memory leaks in libertas driver, from Daniel Drake.
6) Fix deref of out-of-range array index in packet scheduler generic
actions layer. From Hiroaki SHIMODA.
7) Fix TX flow control errors in mlx4 driver, from Yevgeny Petrilin.
8) Fix CRIS eth_v10.c driver build, from Randy Dunlap.
9) Fix wrong SKB freeing in LLC protocol layer, from Sorin Dumitru.
10) The IP output path checks neigh lookup errors incorrectly, it needs
to use IS_ERR(). From Vasiliy Kulikov.
11) An estimator leak leads to deref of freed memory in timer handler,
fix from Hiroaki SHIMODA.
12) TCP early demux in ipv6 needs to use DST cookies in order to
validate the RX route properly. Fix from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
net: ipv6: fix TCP early demux
net: Use PTR_RET rather than if(IS_ERR(.. [1]
net_sched: act: Delete estimator in error path.
ip: fix error handling in ip_finish_output2()
llc: free the right skb
ixp4xx_eth: fix ptp_ixp46x build failure
drivers/atm/iphase.c: fix error return code
tcp_output: fix sparse warning for tcp_wfree
drivers/net/phy/mdio-mux-gpio.c: drop devm_kfree of devm_kzalloc'd data
batman-adv: select an internet gateway if none was chosen
mISDN: Bugfix for layer2 fixed TEI mode
igb: don't break user visible strings over multiple lines in igb_ethtool.c
igb: correct hardware type (i210/i211) check in igb_loopback_test()
igb: Fix for failure to init on some 82576 devices.
cris: fix eth_v10.c build error
cdc-ncm: tag Ericsson WWAN devices (eg F5521gw) with FLAG_WWAN
isdnloop: fix and simplify isdnloop_init()
hyperv: Move wait completion msg code into rndis_filter_halt_device()
net/mlx4_core: Remove port type restrictions
net/mlx4_en: Fixing TX queue stop/wake flow
...
__fls(x) is a bit faster than fls(x), granted we know x is non null.
As Ben Hutchings pointed out, fls(x) = __fls(x) + 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When installing a flow with an action to set a particular field we
need to validate that the packets that are part of the flow actually
contain that header. With IP we use zeroed addresses and with TCP/UDP
the check is for zeroed ports. This check is overly broad and can catch
packets like DHCP requests that have a zero source address in a
legitimate header. This changes the check to look for a zeroed protocol
number for IP or for both ports be zero for TCP/UDP before considering
the header to not exist.
Reported-by: Ethan Jackson <ethan@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
While playing with CoDel and ECN marking, I discovered a
non optimal behavior of receiver of CE (Congestion Encountered)
segments.
In pathological cases, sender has reduced its cwnd to low values,
and receiver delays its ACK (by 40 ms).
While RFC 3168 6.1.3 (The TCP Receiver) doesn't explicitly recommend
to send immediate ACKS, we believe its better to not delay ACKS, because
a CE segment should give same signal than a dropped segment, and its
quite important to reduce RTT to give ECE/CWR signals as fast as
possible.
Note we already call tcp_enter_quickack_mode() from TCP_ECN_check_ce()
if we receive a retransmit, for the same reason.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing TCP ECN tests, I discovered GRO was reordering packets if it
receives one packet with CE set, while previous packets in same NAPI run
have ECT(0) for the same flow :
09:25:25.857620 IP (tos 0x2,ECT(0), ttl 64, id 27893, offset 0, flags
[DF], proto TCP (6), length 4396)
172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
233801:238145, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
1990627], length 4344
09:25:25.857626 IP (tos 0x3,CE, ttl 64, id 27892, offset 0, flags [DF],
proto TCP (6), length 1500)
172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
232353:233801, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
1990627], length 1448
09:25:25.857638 IP (tos 0x0, ttl 64, id 34581, offset 0, flags [DF],
proto TCP (6), length 64)
172.30.42.13.44139 > 172.30.42.19.54550: Flags [.], cksum 0xac8f
(incorrect -> 0xca69), ack 232353, win 1271, options [nop,nop,TS val
1990627 ecr 3397779,nop,nop,sack 1 {233801:238145}], length 0
We have two problems here :
1) GRO reorders packets
If NIC gave packet1, then packet2, which happen to be from "different
flows" GRO feeds stack with packet2, then packet1. I have yet to
understand how to solve this problem.
2) GRO is not ECN friendly
Delivering packets out of order makes TCP stack not as fast as it could
be.
In this patch I suggest we make the tos test not part of the 'same_flow'
determination, but part of the 'should flush' logic
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 needs a cookie in dst_check() call.
We need to add rx_dst_cookie and provide a family independent
sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some action modules free struct tcf_common in their error path
while estimator is still active. This results in est_timer()
dereference freed memory.
Add gen_kill_estimator() in ipt, pedit and simple action.
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
But the caller expects it to return either a pointer or NULL. Replace
the NULL check with IS_ERR() check.
The bug was introduced in a263b30936
("ipv4: Make neigh lookups directly in output packet path.").
Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are freeing skb instead of nskb, resulting in a double
free on skb and a leak from nskb.
Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix sparse warning:
* symbol 'tcp_wfree' was not declared. Should it be static?
Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a regression introduced by: 2265c14108
("batman-adv: gateway election code refactoring")
Reported-by: Nicolás Echániz <nicoechaniz@codigosur.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Acked-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
libertas currently calls cfg80211_disconnected() when it is being
brought down. This causes an event to be allocated, but since the
wdev is already removed from the rdev by the time that the event
processing work executes, the event is never processed or freed.
http://article.gmane.org/gmane.linux.kernel.wireless.general/95666
Fix this leak, and other possible situations, by processing the event
queue when a device is being unregistered. Thanks to Johannes Berg for
the suggestion.
Signed-off-by: Daniel Drake <dsd@laptop.org>
Cc: stable@vger.kernel.org
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
If l2cap_chan_create() fails then it will return from l2cap_sock_kill
since zapped flag of sk is reset.
Signed-off-by: Jaganath Kanakkassery <jaganath.k@samsung.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
smp_chan_create might return NULL so we need to check before
dereferencing smp.
Signed-off-by: Andrei Emeltchenko <andrei.emeltchenko@intel.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
When the name of the given entry is empty , the state needs to be
updated accordingly.
Cc: stable@vger.kernel.org
Signed-off-by: Ram Malovany <ramm@ti.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
If the device was not found in a list of found devices names of which
are pending.This may happen in a case when HCI Remote Name Request
was sent as a part of incoming connection establishment procedure.
Hence there is no need to continue resolving a next name as it will
be done upon receiving another Remote Name Request Complete Event.
This will fix a kernel crash when trying to use this entry to resolve
the next name.
Cc: stable@vger.kernel.org
Signed-off-by: Ram Malovany <ramm@ti.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
If entry wasn't found in the hci_inquiry_cache_lookup_resolve do not
resolve the name.This will fix a kernel crash when trying to use NULL
pointer.
Cc: stable@vger.kernel.org
Signed-off-by: Ram Malovany <ramm@ti.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch moves the hci_conn check to begining of hci_le_conn_
complete_evt in order to improve code's readability and better
error handling.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch does a trivial code refactoring in hci_conn lookup in
hci_le_conn_complete_evt. It performs the hci_conn lookup at the
begining of the function since it is used by both flows (error
and success).
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch changes hci_cs_le_create_conn to perform hci_conn lookup
by state instead of bdaddr.
Since we can have only one LE connection in BT_CONNECT state, we can
perform LE hci_conn lookup by state. This way, we don't rely on
hci_sent_cmd_data helper to find the hci_conn object.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch does some code refactoring in hci_cs_le_create_conn
function. The hci_conn object is only needed in case of failure,
therefore hdev locking and hci_conn lookup were moved to
if-statement scope.
Also, the conn->state check was removed since we should always
close the connection if it fails.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch removes some unneeded code from hci_cs_le_create_conn.
If the hci_conn is not found, it means this LE connection attempt
was triggered by a thrid-party tool (e.g. hcitool). We should not
create this new hci_conn in LE Create Connection command status
event since it is already properly handled in LE Connection
Complete event.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
We need to check the 'Role' parameter from the LE Connection
Complete Event in order to properly set 'out' and 'link_mode'
fields from hci_conn structure.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch replaces the unlock-and-return statements by the goto
statement.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>