As per RFC4821 7.3. Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.
Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.
Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.
In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.
Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24 -t 60
before this patch:
1.26 Gbits/sec
After this patch: increase 26%
1.59 Gbits/sec
Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other users users of the neighbour table use neigh->output as the method
to decided when and which link-layer header to place on a packet.
DECnet has been using neigh->output to decide which DECnet headers to
place on a packet depending which neighbour the packet is destined for.
The DECnet usage isn't totally wrong but it can run into problems if the
neighbour output function is run for a second time as the teql driver
and the bridge netfilter code can do.
Therefore to avoid pathologic problems later down the line and make the
neighbour code easier to understand by refactoring the decnet output
code to only use a neighbour method to add a link layer header to a
packet.
This is done by moving the neigbhour operations lookup from
dn_to_neigh_output to dn_neigh_output_packet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
- Fix a regression in the NFSv4 open state recovery code
- Fix a regression in the NFSv4 close code
- Fix regressions and side-effects of the loop-back mounted NFS fixes
in 3.18, that cause the NFS read() syscall to return EBUSY.
- Fix regressions around the readdirplus code and how it interacts with
the VFS lazy unmount changes that went into v3.18.
- Fix issues with out-of-order RPC call replies replacing updated
attributes with stale ones (particularly after a truncate()).
- Fix an underflow checking issue with RPC/RDMA credits
- Fix a number of issues with the NFSv4 delegation return/free code.
- Fix issues around stale NFSv4.1 leases when doing a mount
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU+SJ6AAoJEGcL54qWCgDyVgQQAKNsF/O2O9ip2uAHZRZvM6TS
Ev8c3Spj/FmRI1tlCcGi1zZ8uCSwvPQz3uAN0vOTMocUWjokT5sAgN5yBIxHasem
6YK7jxs9WHiP7MGyReaAFJwG/W6LZndnlNqPWPs9KiaWwKVXIsQ3uFm/Y0lr90Fi
ew16DQm0DUd4Yvv42WJR9ay7UwUPvT7wmaGIVVK2hjQqr2lx02jspt5kfrC+vsMU
OYU/0YDofb2ajs5krbah6tUHf3VDnSVrXP6if66IrukCM9S4AvowpnMJQ5QJALh+
cPlqHDm2ZzuIecpqZEgYLM73wQ2q+KBXTlDLcgYg6LjnqBEivwO9RDn6tfCwKTcS
tCFohQc9iDOj9rTZ9EQlQME6u/FdxWncpxyTsMyBk7FlcLsOQRqio/FXZhbyJGuH
gvIdIW3fPseijtejpYkxgabe6JL9NRvjv3SnOay7xHs9Vn4tRkFF2mkkQDZOG6HT
atkxQp8kB3m9gMoeAmTLdTcJkcFdk6AKnNrcyJaa1GW4msmMuGZtq/6vayKJsBZb
OIw788bDSOVpcVR+6SAC24/dutcl+WJHlSJShqvIrTKejBxPCc7IVSeg63FJsWTO
sxfXdUr3wVJ1ooDFCspeBj+3zAimIq7qDmRRs85ekgEtxrgdUldhg/VwbDjvLjmb
whXFpiCS7Ii0fVYtZvjz
=qMB7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Fix a regression in the NFSv4 open state recovery code
- Fix a regression in the NFSv4 close code
- Fix regressions and side-effects of the loop-back mounted NFS fixes
in 3.18, that cause the NFS read() syscall to return EBUSY.
- Fix regressions around the readdirplus code and how it interacts
with the VFS lazy unmount changes that went into v3.18.
- Fix issues with out-of-order RPC call replies replacing updated
attributes with stale ones (particularly after a truncate()).
- Fix an underflow checking issue with RPC/RDMA credits
- Fix a number of issues with the NFSv4 delegation return/free code.
- Fix issues around stale NFSv4.1 leases when doing a mount"
* tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4.1: Clear the old state by our client id before establishing a new lease
NFSv4: Fix a race in NFSv4.1 server trunking discovery
NFS: Don't write enable new pages while an invalidation is proceeding
NFS: Fix a regression in the read() syscall
NFSv4: Ensure we skip delegations that are already being returned
NFSv4: Pin the superblock while we're returning the delegation
NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
NFSv4: Ensure that we don't reap a delegation that is being returned
NFS: Fix stateid used for NFS v4 closes
NFSv4: Don't call put_rpccred() under the rcu_read_lock()
NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
NFSv3: Use the readdir fileid as the mounted-on-fileid
NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
NFSv4: Set a barrier in the update_changeattr() helper
NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
NFS: Remove size hack in nfs_inode_attrs_need_update()
NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
NFS: Add attribute update barriers to NFS writebacks
NFS: Set an attribute barrier on all updates
NFS: Add attribute update barriers to nfs_setattr_update_inode()
...
net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’:
net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable]
int found = 0;
^
net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable]
unsigned char slen;
^
Signed-off-by: David S. Miller <davem@davemloft.net>
Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB. The switchdev driver may or
may not install the route to the offload device. In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, route forwarding falls back to the kernel, and no more routes are
offloading.
We can refine this logic later. For now, use the simplist model of offloading
routes up to the point of failure, and then on failure, undo everything and
mark IPv4 offloading disabled.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flesh out ndo wrappers to call into device driver. To call into device driver,
the wrapper must interate over route's nexthops to ensure all nexthop devs
belong to the same switch device. Currently, there is no support for route's
nexthops spanning offloaded and non-offloaded devices, or spanning ports of
multiple offload devices.
Since switch device ports may be stacked under virtual interfaces (bonds and/or
bridges), and the route's nexthop may be on the virtual interface, the wrapper
will traverse the nexthop dev down to the base dev. It's the base dev that's
passed to the switchdev driver's ndo ops.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IPv4 fib ndo wrapper funcs and stub them out for now.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract the core logic that setups a 'struct dsa_switch_tree' and
removes it, update dsa_probe() and dsa_remove() to use the two helper
functions. This will be useful to allow for other callers to setup
this structure differently.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support the new DSA device driver model, a dsa_switch should
be able to advertise the type of tagging protocol supported by the
underlying switch device. This also removes constraints on how tagging
can be stacked to each other.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the part of dsa_switch_setup() which is responsible for allocating
and initializing a 'struct dsa_switch' and the part which is doing a
given switch device setup and slave network device creation.
This is a preliminary change to allow a separate caller of
dsa_switch_setup_one() which may have externally initialized the
dsa_switch structure, outside of dsa_switch_setup().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing a different model to register DSA switches,
update dsa_of_probe() and dsa_probe() to return -EPROBE_DEFER where
appropriate.
Failure to find a phandle or Device Tree property is still fatal, but
looking up the internal device structure associated with a Device Tree
node is something that might need to be delayed based on driver probe
ordering.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing a different mechanism to register DSA switch
devices and driver, update dsa_of_probe and dsa_of_remove to take a
struct device pointer since neither of these two functions uses the
struct platform_device pointer.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
timewait sockets now share a common base with established sockets.
inet_twsk_diag_dump() can use inet_diag_bc_sk() instead of duplicating
code, granted that inet_diag_bc_sk() does proper userlocks
initialization.
twsk_build_assert() will catch any future changes that could break
the assumptions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With some mss values, it is possible tcp_xmit_size_goal() puts
one segment more in TSO packet than tcp_tso_autosize().
We send then one TSO packet followed by one single MSS.
It is not a serious bug, but we can do slightly better, especially
for drivers using netif_set_gso_max_size() to lower gso_max_size.
Using same formula avoids these corner cases and makes
tcp_xmit_size_goal() a bit faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 605ad7f184 ("tcp: refine TSO autosizing")
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip/udp bearer can be configured in a point-to-point
mode by specifying both local and remote ip/hostname,
or it can be enabled in multicast mode, where links are
established to all tipc nodes that have joined the same
multicast group. The multicast IP address is generated
based on the TIPC network ID, but can be overridden by
using another multicast address as remote ip.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The payload area following the TIPC discovery message header is an
opaque area defined by the media. INT_H_SIZE was enough for
Ethernet/IB/IPv4 but needs to be expanded to carry IPv6 addressing
information.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Don't truncate ethernet protocol type to u8 in nft_compat, from
Arturo Borrero.
2) Fix several problems in the addition/deletion of elements in nf_tables.
3) Fix module refcount leak in ip_vs_sync, from Julian Anastasov.
4) Fix a race condition in the abort path in the nf_tables transaction
infrastructure. Basically aborted rules can show up as active rules
until changes are unrolled, oneliner from Patrick McHardy.
5) Check for overflows in the data area of the rule, also from Patrick.
6) Fix off-by-one in the per-rule user data size field. This introduces
a new nft_userdata structure that is placed at the beginning of the
user data area that contains the length to save some bits from the
rule and we only need one bit to indicate its presence, from Patrick.
7) Fix rule replacement error path, the replaced rule is deleted on
error instead of leaving it in place. This has been fixed by relying
on the abort path to undo the incomplete replacement.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_check_defrag() may be used by af_packet to defragment outgoing packets.
skb_network_offset() of af_packet's outgoing packets is not zero.
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes this build error:
net/mpls/af_mpls.c: In function 'resize_platform_label_table':
net/mpls/af_mpls.c:767:4: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
labels = vzalloc(size);
^
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* JUMP and GOTO are equivalent except for JUMP pushing the current
context to the stack
* RETURN and implicit RETURN (CONTINUE) are equivalent except that
the logged rule number differs
Result:
nft_do_chain | -112
1 function changed, 112 bytes removed, diff: -112
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The tracing code is squeezed between multiple related parts of the
evaluation code, move it out. Also add an inline wrapper for the
reoccuring test for skb->nf_trace.
Small code savings in nft_do_chain():
nft_trace_packet | -137
nft_do_chain | -8
2 functions changed, 145 bytes removed, diff: -145
net/netfilter/nf_tables_core.c:
__nft_trace_packet | +137
1 function changed, 137 bytes added, diff: +137
net/netfilter/nf_tables_core.o:
3 functions changed, 137 bytes added, 145 bytes removed, diff: -8
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xt_cluster supersedes ipt_CLUSTERIP since it can be also used in
gateway configurations (not only from the backend side).
ipt_CLUSTER is also known to leak the netdev that it uses on
device removal, which requires a rather large fix to workaround
the problem: http://patchwork.ozlabs.org/patch/358629/
So let's deprecate this so we can probably kill code this in the
future.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This extends the design in commit 958501163d ("bridge: Add support for
IEEE 802.11 Proxy ARP") with optional set of rules that are needed to
meet the IEEE 802.11 and Hotspot 2.0 requirements for ProxyARP. The
previously added BR_PROXYARP behavior is left as-is and a new
BR_PROXYARP_WIFI alternative is added so that this behavior can be
configured from user space when required.
In addition, this enables proxyarp functionality for unicast ARP
requests for both BR_PROXYARP and BR_PROXYARP_WIFI since it is possible
to use unicast as well as broadcast for these frames.
The key differences in functionality:
BR_PROXYARP:
- uses the flag on the bridge port on which the request frame was
received to determine whether to reply
- block bridge port flooding completely on ports that enable proxy ARP
BR_PROXYARP_WIFI:
- uses the flag on the bridge port to which the target device of the
request belongs
- block bridge port flooding selectively based on whether the proxyarp
functionality replied
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code to prevent us from attempting to allocate a tnode with
a size larger than what can be represented by size_t.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change updates the fib_table_lookup function so that it is in sync
with the fib_find_node function in terms of the explanation for the index
check based on the bits value.
I have also updated it from doing a mask to just doing a compare as I have
found that seems to provide more options to the compiler as I have seen it
turn this into a shift of the value and test under some circumstances.
In addition I addressed one minor issue in which we kept computing the key
^ n->key when checking the fib aliases. I pulled the xor out of the loop
in order to reduce the number of memory reads in the lookup. As a result
we should save a couple cycles since the xor is only done once much earlier
in the lookup.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fib_table was wrapped in several places with an
rcu_read_lock/rcu_read_unlock however after looking over the code I found
several spots where the tables were being accessed as just standard
pointers without any protections. This change fixes that so that all of
the proper protections are in place when accessing the table to take RCU
replacement or removal of the table into account.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we are going to compact the leaf and tnode we first need to make sure
the fields are all in the same place. In that regard I am moving the leaf
pointer which represents the fib_alias hash list to occupy what is
currently the first key_vector pointer.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that the insert and delete functions make use of
the tnode pointer returned in the fib_find_node call. By doing this we
will not have to rely on the parent pointer in the leaf which will be going
away soon.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that the parent pointer is returned by reference in
fib_find_node. By doing this I can use it to find the parent node when I
am performing an insertion and I don't have to look for it again in
fib_insert_node.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that leaf_walk_rcu takes a tnode and a key instead
of the trie and a leaf.
The main idea behind this is to avoid using the leaf parent pointer as that
can have additional overhead in the future as I am trying to reduce the
size of a leaf down to 16 bytes on 64b systems and 12b on 32b systems.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we only call resize on the tnodes, instead of
from each of the leaves. By doing this we can significantly reduce the
amount of time spent resizing as we can update all of the leaves in the
tnode first before we make any determinations about resizing. As a result
we can simply free the tnode in the case that all of the leaves from a
given tnode are flushed instead of resizing with each leaf removed.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. For an IPv4 ping socket, ping_check_bind_addr does not check
the family of the socket address that's passed in. Instead,
make it behave like inet_bind, which enforces either that the
address family is AF_INET, or that the family is AF_UNSPEC and
the address is 0.0.0.0.
2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
if the socket family is not AF_INET6. Return EAFNOSUPPORT
instead, for consistency with inet6_bind.
3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
instead of EINVAL if an incorrect socket address structure is
passed in.
4. Make IPv6 ping sockets be IPv6-only. The code does not support
IPv4, and it cannot easily be made to support IPv4 because
the protocol numbers for ICMP and ICMPv6 are different. This
makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
of making the socket unusable.
Among other things, this fixes an oops that can be triggered by:
int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
struct sockaddr_in6 sin6 = {
.sin6_family = AF_INET6,
.sin6_addr = in6addr_any,
};
bind(s, (struct sockaddr *) &sin6, sizeof(sin6));
Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In general, if a transaction object is added to the list successfully,
we can rely on the abort path to undo what we've done. This allows us to
simplify the error handling of the rule replacement path in
nf_tables_newrule().
This implicitly fixes an unnecessary removal of the old rule, which
needs to be left in place if we fail to replace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NFT_USERDATA_MAXLEN is defined to 256, however we only have a u8
to store its size. Introduce a struct nft_userdata which contains a
length field and indicate its presence using a single bit in the rule.
The length field of struct nft_userdata is also a u8, however we don't
store zero sized data, so the actual length is udata->len + 1.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check that the space required for the expressions doesn't exceed the
size of the dlen field, which would lead to the iterators crashing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A race condition exists in the rule transaction code for rules that
get added and removed within the same transaction.
The new rule starts out as inactive in the current and active in the
next generation and is inserted into the ruleset. When it is deleted,
it is additionally set to inactive in the next generation as well.
On commit the next generation is begun, then the actions are finalized.
For the new rule this would mean clearing out the inactive bit for
the previously current, now next generation.
However nft_rule_clear() clears out the bits for *both* generations,
activating the rule in the current generation, where it should be
deactivated due to being deleted. The rule will thus be active until
the deletion is finalized, removing the rule from the ruleset.
Similarly, when aborting a transaction for the same case, the undo
of insertion will remove it from the RCU protected rule list, the
deletion will clear out all bits. However until the next RCU
synchronization after all operations have been undone, the rule is
active on CPUs which can still see the rule on the list.
Generally, there may never be any modifications of the current
generations' inactive bit since this defeats the entire purpose of
atomicity. Change nft_rule_clear() to only touch the next generations
bit to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike IPv4 this code notifies on all cases where mpls routes
are added or removed and it never automatically removes routes.
Avoiding both the userspace confusion that is caused by omitting
route updates and the possibility of a flood of netlink traffic
when an interface goes doew.
For now reserved labels are handled automatically and userspace
is not notified.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds two new netlink routing attributes:
RTA_VIA and RTA_NEWDST.
RTA_VIA specifies the specifies the next machine to send a packet to
like RTA_GATEWAY. RTA_VIA differs from RTA_GATEWAY in that it
includes the address family of the address of the next machine to send
a packet to. Currently the MPLS code supports addresses in AF_INET,
AF_INET6 and AF_PACKET. For AF_INET and AF_INET6 the destination mac
address is acquired from the neighbour table. For AF_PACKET the
destination mac_address is specified in the netlink configuration.
I think raw destination mac address support with the family AF_PACKET
will prove useful. There is MPLS-TP which is defined to operate
on machines that do not support internet packets of any flavor. Further
seem to be corner cases where it can be useful. At this point
I don't care much either way.
RTA_NEWDST specifies the destination address to forward the packet
with. MPLS typically changes it's destination address at every hop.
For a swap operation RTA_NEWDST is specified with a length of one label.
For a push operation RTA_NEWDST is specified with two or more labels.
For a pop operation RTA_NEWDST is not specified or equivalently an emtpy
RTAN_NEWDST is specified.
Those new netlink attributes are used to implement handling of rt-netlink
RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the
MPLS label table.
rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message,
verify no unhandled attributes or unhandled values are present and sets
up the data structures for mpls_route_add and mpls_route_del.
I did my best to match up with the existing conventions with the caveats
that MPLS addresses are all destination-specific-addresses, and so
don't properly have a scope.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reading and writing addresses in network byte order in netlink is
traditional and I see no reason to change that. MPLS is interesting
as effectively it has variabely length addresses (the MPLS label
stack). To represent these variable length addresses in netlink
I use a valid MPLS label stack (complete with stop bit).
This achieves two things: a well defined existing format is used,
and the data can be interpreted without looking at it's length.
Not needed to look at the length to decode the variable length
network representation allows existing userspace functions
such as inet_ntop to be used without needed to change their
prototype.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mpls_route_add and mpls_route_del implement the basic logic for adding
and removing Next Hop Label Forwarding Entries from the MPLS input
label map. The addition and subtraction is done in a way that is
consistent with how the existing routing table in Linux are
maintained. Thus all of the work to deal with NLM_F_APPEND,
NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.
Cases that are not clearly defined such as changing the interpretation
of the mpls reserved labels is not allowed.
Because it seems like the right thing to do adding an MPLS route without
specifying an input label and allowing the kernel to pick a free label
table entry is supported. The implementation is currently less than optimal
but that can be changed.
As I don't have anything else to test with only ethernet and the loopback
device are the only two device types currently supported for forwarding
MPLS over.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sysctl gives two benefits. By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets. This prevents unpleasant surprises for users.
The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.
This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>