Commit graph

471223 commits

Author SHA1 Message Date
Nimrod Andy
b64bf4b7dd net: fec: align rx data buffer size for dma map/unmap
Align allocated rx data buffer size for dma map/unmap, otherwise
kernel print warning when enable DMA_API_DEBUG.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 16:05:21 -04:00
Nimrod Andy
f88c7ede50 net: fec: remove the ERR006358 workaround for imx6sx enet
Remove the ERR006358 workaround for imx6sx enet since the hw issue
was fixed on the SOC.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 16:05:21 -04:00
Nimrod Andy
befe821335 net: fec: Add Ftype to BD to distiguish three tx queues for AVB
The current driver loss Ftype field init for BD, which cause tx
queue #1 and #2 cannot work well.

Add Ftype field to BD to distiguish three queues for AVB:
0 -> Best Effort
1 -> ClassA
2 -> ClassB

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 16:05:21 -04:00
Soren Brinkmann
9026968abe Revert "net/macb: add pinctrl consumer support"
This reverts commit 8ef29f8aae.
The driver core already calls pinctrl_get() and claims the default
state. There is no need to replicate this in the driver.
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>

Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:48:37 -04:00
Eric Dumazet
f4a775d144 net: introduce __skb_header_release()
While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().

It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.

Introduce __skb_header_release() to clearly document this.

This gave me a 1.5 % improvement on TCP_RR workload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:40:06 -04:00
Fabio Estevam
aebac74493 fec: Remove fec_enet_select_queue()
Sparse complains about fec_enet_select_queue() not being static.

Feedback from David Miller [1] was to remove this function instead of making it
static:

"Please just delete this function.

It's overriding code which does exactly the same thing.

Actually, more precisely, this code is duplicating code in a way that
bypasses many core facilitites of the networking.  For example, this
override means that socket based flow steering, XPS, etc. are all
not happening on these devices.

Without ->ndo_select_queue(), the flow dissector does __netdev_pick_tx
which is exactly what you want to happen."

[1] http://www.spinics.net/lists/netdev/msg297653.html

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:39:59 -04:00
David S. Miller
57219dc7bf Merge tag 'master-2014-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
pull request: wireless-next 2014-09-22

Please pull this batch of updates intended for the 3.18 stream...

For the mac80211 bits, Johannes says:

"This time, I have some rate minstrel improvements, support for a very
small feature from CCX that Steinar reverse-engineered, dynamic ACK
timeout support, a number of changes for TDLS, early support for radio
resource measurement and many fixes. Also, I'm changing a number of
places to clear key memory when it's freed and Intel claims copyright
for code they developed."

For the bluetooth bits, Johan says:

"Here are some more patches intended for 3.18. Most of them are cleanups
or fixes for SMP. The only exception is a fix for BR/EDR L2CAP fixed
channels which should now work better together with the L2CAP
information request procedure."

For the iwlwifi bits, Emmanuel says:

"I fix here dvm which was broken by my last pull request. Arik
continues to work on TDLS and Luca solved a few issues in CT-Kill. Eyal
keeps digging into rate scaling code, more to come soon. Besides this,
nothing really special here."

Beyond that, there are the usual big batches of updates to ath9k, b43,
mwifiex, and wil6210 as well as a handful of other bits here and there.
Also, rtlwifi gets some btcoexist attention from Larry.

Please let me know if there are problems!
====================

Had to adjust the wil6210 code to comply with Joe Perches's recent
change in net-next to make the netdev_*() routines return void instead
of 'int'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:39:24 -04:00
Vlad Yasevich
40b8fe45d1 macvtap: Fix race between device delete and open.
In macvtap device delete and open calls can race and
this causes a list curruption of the vlan queue_list.

The race intself is triggered by the idr accessors
that located the vlan device.  The device is stored
into and removed from the idr under both an rtnl and
a mutex.  However, when attempting to locate the device
in idr, only a mutex is taken.  As a result, once cpu
perfoming a delete may take an rtnl and wait for the mutex,
while another cput doing an open() will take the idr
mutex first to fetch the device pointer and later take
an rtnl to add a queue for the device which may have
just gotten deleted.

With this patch, we now hold the rtnl for the duration
of the macvtap_open() call thus making sure that
open will not race with delete.

CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:20:26 -04:00
Joe Perches
6ea754eb76 net: Change netdev_<level> logging functions to return void
No caller or macro uses the return value so make all
the functions return void.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:17:17 -04:00
Joe Perches
0c87b29c31 mellanox: Change en_print to return void
No caller or macro uses the return value so make it void.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:17:16 -04:00
David S. Miller
2b07f0d2f6 Merge branch 'qlcnic'
Manish Chopra says:

====================
qlcnic: Bug fixes.

This patch series contains following bug fixes:

* Fixes related to ethtool statistics.
* Fix for flash read related API.

Please apply this series to 'net'.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:11:37 -04:00
Manish Chopra
3456399b03 qlcnic: Fix ordering of stats in stats buffer.
o When TX queues are not allocated, driver does not fill TX queues stats in the buffer.
  However, it is also not advancing data pointer by TX queue stats length, which would
  misplace all successive stats data in the buffer and will result in mismatch between
  stats strings and it's values.

o Fix this by advancing data pointer by TX queue stats length when
  queues are not allocated.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:11:31 -04:00
Manish Chopra
6c0fd0df0c qlcnic: Remove __QLCNIC_DEV_UP bit check to read TX queues statistics.
o TX queues stats must be read when queues are allocated regardless
  of interface is up or not.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:11:31 -04:00
Manish Chopra
c023030198 qlcnic: Fix memory corruption while reading stats using ethtool.
o  Driver is doing memset with zero for total number of stats bytes when
   it has already filled some data in the stats buffer, which can overwrite
   memory area beyond the length of stats buffer.

o  Fix this by initializing stats buffer with zero before filling any data in it.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:11:31 -04:00
Sony Chacko
4324414f8c qlcnic: Use qlcnic_83xx_flash_read32() API instead of lockless version of the API.
In qlcnic_83xx_setup_idc_parameters() routine use qlcnic_83xx_flash_read32() API
which takes flash lock internally instead of the lockless version
qlcnic_83xx_lockless_flash_read32().

Signed-off-by: Sony Chacko <sony.chacko@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:11:31 -04:00
David S. Miller
b4fc1a460f Merge branch 'bpf-next'
Alexei Starovoitov says:

====================
eBPF syscall, verifier, testsuite

v14 -> v15:
- got rid of macros with hidden control flow (suggested by David)
  replaced macro with explicit goto or return and simplified
  where possible (affected patches #9 and #10)
- rebased, retested

v13 -> v14:
- small change to 1st patch to ease 'new userspace with old kernel'
  problem (done similar to perf_copy_attr()) (suggested by Daniel)
- the rest unchanged

v12 -> v13:
- replaced 'foo __user *' pointers with __aligned_u64 (suggested by David)
- added __attribute__((aligned(8)) to 'union bpf_attr' to keep
  constant alignment between patches
- updated manpage and syscall wrappers due to __aligned_u64
- rebased, retested on x64 with 32-bit and 64-bit userspace and on i386,
  build tested on arm32,sparc64

v11 -> v12:
- dropped patch 11 and copied few macros to libbpf.h (suggested by Daniel)
- replaced 'enum bpf_prog_type' with u32 to be safe in compat (.. Andy)
- implemented and tested compat support (not part of this set) (.. Daniel)
- changed 'void *log_buf' to 'char *' (.. Daniel)
- combined struct bpf_work_struct and bpf_prog_info (.. Daniel)
- added better return value explanation to manpage (.. Andy)
- added log_buf/log_size explanation to manpage (.. Andy & Daniel)
- added a lot more info about prog_type and map_type to manpage (.. Andy)
- rebased, tweaked test_stubs

Patches 1-4 establish BPF syscall shell for maps and programs.
Patches 5-10 add verifier step by step
Patch 11 adds test stubs for 'unspec' program type and verifier testsuite
  from user space

Note that patches 1,3,4,7 add commands and attributes to the syscall
while being backwards compatible from each other, which should demonstrate
how other commands can be added in the future.

After this set the programs can be loaded for testing only. They cannot
be attached to any events. Though manpage talks about tracing and sockets,
it will be a subject of future patches.

Please take a look at manpage:

BPF(2)                     Linux Programmer's Manual                    BPF(2)

NAME
       bpf - perform a command on eBPF map or program

SYNOPSIS
       #include <linux/bpf.h>

       int bpf(int cmd, union bpf_attr *attr, unsigned int size);

DESCRIPTION
       bpf()  syscall  is a multiplexor for a range of different operations on
       eBPF  which  can  be  characterized  as  "universal  in-kernel  virtual
       machine".  eBPF  is  similar  to  original  Berkeley  Packet Filter (or
       "classic BPF") used to filter network packets. Both statically  analyze
       the  programs  before  loading  them  into  the  kernel  to ensure that
       programs cannot harm the running system.

       eBPF extends classic BPF in multiple ways including ability to call in-
       kernel  helper  functions  and  access shared data structures like eBPF
       maps.  The programs can be written in a restricted C that  is  compiled
       into  eBPF  bytecode  and executed on the eBPF virtual machine or JITed
       into native instruction set.

   eBPF Design/Architecture
       eBPF maps is a generic storage of different types.   User  process  can
       create  multiple  maps  (with key/value being opaque bytes of data) and
       access them via file descriptor. In parallel eBPF programs  can  access
       maps  from inside the kernel.  It's up to user process and eBPF program
       to decide what they store inside maps.

       eBPF programs are similar to kernel modules. They  are  loaded  by  the
       user  process  and automatically unloaded when process exits. Each eBPF
       program is a safe run-to-completion set of instructions. eBPF  verifier
       statically  determines  that  the  program  terminates  and  is safe to
       execute. During verification the program takes a hold of maps  that  it
       intends to use, so selected maps cannot be removed until the program is
       unloaded. The program can be attached to different events. These events
       can  be packets, tracepoint events and other types in the future. A new
       event triggers execution of the program  which  may  store  information
       about the event in the maps.  Beyond storing data the programs may call
       into in-kernel helper functions which may, for example, dump stack,  do
       trace_printk  or other forms of live kernel debugging. The same program
       can be attached to multiple events. Different programs can  access  the
       same map:
         tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
          event A     event B     event C      on eth0    on eth1
           |             |          |            |          |
           |             |          |            |          |
           --> tracing <--      tracing       socket      socket
                prog_1           prog_2       prog_3      prog_4
                |  |               |            |
             |---  -----|  |-------|           map_3
           map_1       map_2

   Syscall Arguments
       bpf()  syscall  operation  is determined by cmd which can be one of the
       following:

       BPF_MAP_CREATE
              Create a map with given type and attributes and return map FD

       BPF_MAP_LOOKUP_ELEM
              Lookup element by key in a given map and return its value

       BPF_MAP_UPDATE_ELEM
              Create or update element (key/value pair) in a given map

       BPF_MAP_DELETE_ELEM
              Lookup and delete element by key in a given map

       BPF_MAP_GET_NEXT_KEY
              Lookup element by key in a given map  and  return  key  of  next
              element

       BPF_PROG_LOAD
              Verify and load eBPF program

       attr   is a pointer to a union of type bpf_attr as defined below.

       size   is the size of the union.

       union bpf_attr {
           struct { /* anonymous struct used by BPF_MAP_CREATE command */
               __u32             map_type;
               __u32             key_size;    /* size of key in bytes */
               __u32             value_size;  /* size of value in bytes */
               __u32             max_entries; /* max number of entries in a map */
           };

           struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
               __u32             map_fd;
               __aligned_u64     key;
               union {
                   __aligned_u64 value;
                   __aligned_u64 next_key;
               };
           };

           struct { /* anonymous struct used by BPF_PROG_LOAD command */
               __u32         prog_type;
               __u32         insn_cnt;
               __aligned_u64 insns;     /* 'const struct bpf_insn *' */
               __aligned_u64 license;   /* 'const char *' */
               __u32         log_level; /* verbosity level of eBPF verifier */
               __u32         log_size;  /* size of user buffer */
               __aligned_u64 log_buf;   /* user supplied 'char *' buffer */
           };
       } __attribute__((aligned(8)));

   eBPF maps
       maps  is  a generic storage of different types for sharing data between
       kernel and userspace.

       Any map type has the following attributes:
         . type
         . max number of elements
         . key size in bytes
         . value size in bytes

       The following wrapper functions demonstrate how  this  syscall  can  be
       used  to  access the maps. The functions use the cmd argument to invoke
       different operations.

       BPF_MAP_CREATE
              int bpf_create_map(enum bpf_map_type map_type, int key_size,
                                 int value_size, int max_entries)
              {
                  union bpf_attr attr = {
                      .map_type = map_type,
                      .key_size = key_size,
                      .value_size = value_size,
                      .max_entries = max_entries
                  };

                  return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
              }
              bpf()  syscall  creates  a  map  of  map_type  type  and   given
              attributes  key_size,  value_size,  max_entries.   On success it
              returns process-local file descriptor. On error, -1 is  returned
              and errno is set to EINVAL or EPERM or ENOMEM.

              The  attributes key_size and value_size will be used by verifier
              during  program  loading  to  check  that  program  is   calling
              bpf_map_*_elem() helper functions with correctly initialized key
              and  that  program  doesn't  access  map  element  value  beyond
              specified  value_size.   For  example,  when map is created with
              key_size = 8 and program does:
              bpf_map_lookup_elem(map_fd, fp - 4)
              such program will be rejected, since in-kernel  helper  function
              bpf_map_lookup_elem(map_fd,  void  *key) expects to read 8 bytes
              from 'key' pointer, but 'fp - 4' starting address will cause out
              of bounds stack access.

              Similarly,  when  map is created with value_size = 1 and program
              does:
              value = bpf_map_lookup_elem(...);
              *(u32 *)value = 1;
              such program will be rejected, since it accesses  value  pointer
              beyond specified 1 byte value_size limit.

              Currently only hash table map_type is supported:
              enum bpf_map_type {
                 BPF_MAP_TYPE_UNSPEC,
                 BPF_MAP_TYPE_HASH,
              };
              map_type  selects  one  of  the available map implementations in
              kernel. For all map_types eBPF programs  access  maps  with  the
              same      bpf_map_lookup_elem()/bpf_map_update_elem()     helper
              functions.

       BPF_MAP_LOOKUP_ELEM
              int bpf_lookup_elem(int fd, void *key, void *value)
              {
                  union bpf_attr attr = {
                      .map_fd = fd,
                      .key = ptr_to_u64(key),
                      .value = ptr_to_u64(value),
                  };

                  return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
              }
              bpf() syscall looks up an element with given key in  a  map  fd.
              If  element  is found it returns zero and stores element's value
              into value.  If element is not found  it  returns  -1  and  sets
              errno to ENOENT.

       BPF_MAP_UPDATE_ELEM
              int bpf_update_elem(int fd, void *key, void *value)
              {
                  union bpf_attr attr = {
                      .map_fd = fd,
                      .key = ptr_to_u64(key),
                      .value = ptr_to_u64(value),
                  };

                  return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
              }
              The  call  creates  or updates element with given key/value in a
              map fd.  On success it returns zero.  On error, -1  is  returned
              and  errno  is set to EINVAL or EPERM or ENOMEM or E2BIG.  E2BIG
              indicates that number of elements in the map reached max_entries
              limit specified at map creation time.

       BPF_MAP_DELETE_ELEM
              int bpf_delete_elem(int fd, void *key)
              {
                  union bpf_attr attr = {
                      .map_fd = fd,
                      .key = ptr_to_u64(key),
                  };

                  return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
              }
              The call deletes an element in a map fd with given key.  Returns
              zero on success. If element is not found it returns -1 and  sets
              errno to ENOENT.

       BPF_MAP_GET_NEXT_KEY
              int bpf_get_next_key(int fd, void *key, void *next_key)
              {
                  union bpf_attr attr = {
                      .map_fd = fd,
                      .key = ptr_to_u64(key),
                      .next_key = ptr_to_u64(next_key),
                  };

                  return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
              }
              The  call  looks  up  an  element  by  key in a given map fd and
              returns key of the next element into next_key pointer. If key is
              not  found,  it return zero and returns key of the first element
              into next_key. If key is the last element,  it  returns  -1  and
              sets  errno  to  ENOENT. Other possible errno values are ENOMEM,
              EFAULT, EPERM, EINVAL.  This method can be used to iterate  over
              all elements of the map.

       close(map_fd)
              will  delete  the  map  map_fd.  Exiting process will delete all
              maps automatically.

   eBPF programs
       BPF_PROG_LOAD
              This cmd is used to load eBPF program into the kernel.

              char bpf_log_buf[LOG_BUF_SIZE];

              int bpf_prog_load(enum bpf_prog_type prog_type,
                                const struct bpf_insn *insns, int insn_cnt,
                                const char *license)
              {
                  union bpf_attr attr = {
                      .prog_type = prog_type,
                      .insns = ptr_to_u64(insns),
                      .insn_cnt = insn_cnt,
                      .license = ptr_to_u64(license),
                      .log_buf = ptr_to_u64(bpf_log_buf),
                      .log_size = LOG_BUF_SIZE,
                      .log_level = 1,
                  };

                  return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
              }
              prog_type is one of the available program types:
              enum bpf_prog_type {
                      BPF_PROG_TYPE_UNSPEC,
                      BPF_PROG_TYPE_SOCKET,
                      BPF_PROG_TYPE_TRACING,
              };
              By picking prog_type program author  selects  a  set  of  helper
              functions callable from eBPF program and corresponding format of
              struct bpf_context (which is  the  data  blob  passed  into  the
              program  as  the  first  argument).   For  example, the programs
              loaded with  prog_type  =  TYPE_TRACING  may  call  bpf_printk()
              helper,  whereas  TYPE_SOCKET  programs  may  not.   The  set of
              functions  available  to  the  programs  under  given  type  may
              increase in the future.

              Currently the set of functions for TYPE_TRACING is:
              bpf_map_lookup_elem(map_fd, void *key)              // lookup key in a map_fd
              bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
              bpf_map_delete_elem(map_fd, void *key)              // delete key in a map_fd
              bpf_ktime_get_ns(void)                              // returns current ktime
              bpf_printk(char *fmt, int fmt_size, ...)            // prints into trace buffer
              bpf_memcmp(void *ptr1, void *ptr2, int size)        // non-faulting memcmp
              bpf_fetch_ptr(void *ptr)    // non-faulting load pointer from any address
              bpf_fetch_u8(void *ptr)     // non-faulting 1 byte load
              bpf_fetch_u16(void *ptr)    // other non-faulting loads
              bpf_fetch_u32(void *ptr)
              bpf_fetch_u64(void *ptr)

              and bpf_context is defined as:
              struct bpf_context {
                  /* argN fields match one to one to arguments passed to trace events */
                  u64 arg1, arg2, arg3, arg4, arg5, arg6;
                  /* return value from kretprobe event or from syscall_exit event */
                  u64 ret;
              };

              The set of helper functions for TYPE_SOCKET is TBD.

              More   program   types   may   be  added  in  the  future.  Like
              BPF_PROG_TYPE_USER_TRACING for unprivileged programs.

              BPF_PROG_TYPE_UNSPEC is used for  testing  only.  Such  programs
              cannot be attached to events.

              insns array of "struct bpf_insn" instructions

              insn_cnt number of instructions in the program

              license  license  string,  which  must be GPL compatible to call
              helper functions marked gpl_only

              log_buf user supplied buffer that in-kernel verifier is using to
              store  verification  log. Log is a multi-line string that should
              be used by program author to understand  how  verifier  came  to
              conclusion  that program is unsafe. The format of the output can
              change at any time as verifier evolves.

              log_size size of user buffer. If size of the buffer is not large
              enough  to store all verifier messages, -1 is returned and errno
              is set to ENOSPC.

              log_level verbosity level of eBPF verifier, where zero means  no
              logs provided

       close(prog_fd)
              will unload eBPF program

       The  maps  are  accesible  from  programs  and  generally  tie  the two
       together.  Programs process various events  (like  tracepoint,  kprobe,
       packets)  and  store  the  data into maps. User space fetches data from
       maps.  Either the same or a different map may be used by user space  as
       configuration space to alter program behavior on the fly.

   Events
       Once an eBPF program is loaded, it can be attached to an event. Various
       kernel subsystems have different ways to do so. For example:

       setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
       will attach the program prog_fd to socket sock which  was  received  by
       prior call to socket().

       ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
       will  attach  the  program  prog_fd  to  perf  event event_fd which was
       received by prior call to perf_event_open().

       Another way to attach the program to a tracing event is:
       event_fd = open("/sys/kernel/debug/tracing/events/skb/kfree_skb/filter");
       write(event_fd, "bpf-123"); /* where 123 is eBPF program FD */
       /* here program is attached and will be triggered by events */
       close(event_fd); /* to detach from event */

EXAMPLES
       /* eBPF+sockets example:
        * 1. create map with maximum of 2 elements
        * 2. set map[6] = 0 and map[17] = 0
        * 3. load eBPF program that counts number of TCP and UDP packets received
        *    via map[skb->ip->proto]++
        * 4. attach prog_fd to raw socket via setsockopt()
        * 5. print number of received TCP/UDP packets every second
        */
       int main(int ac, char **av)
       {
           int sock, map_fd, prog_fd, key;
           long long value = 0, tcp_cnt, udp_cnt;

           map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 2);
           if (map_fd < 0) {
               printf("failed to create map '%s'\n", strerror(errno));
               /* likely not run as root */
               return 1;
           }

           key = 6; /* ip->proto == tcp */
           assert(bpf_update_elem(map_fd, &key, &value) == 0);

           key = 17; /* ip->proto == udp */
           assert(bpf_update_elem(map_fd, &key, &value) == 0);

           struct bpf_insn prog[] = {
               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),          /* r6 = r1 */
               BPF_LD_ABS(BPF_B, 14 + 9),                    /* r0 = ip->proto */
               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4),/* *(u32 *)(fp - 4) = r0 */
               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),         /* r2 = fp */
               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),        /* r2 = r2 - 4 */
               BPF_LD_MAP_FD(BPF_REG_1, map_fd),             /* r1 = map_fd */
               BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),      /* r0 = map_lookup(r1, r2) */
               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),        /* if (r0 == 0) goto pc+2 */
               BPF_MOV64_IMM(BPF_REG_1, 1),                  /* r1 = 1 */
               BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *)r0 += r1 */
               BPF_MOV64_IMM(BPF_REG_0, 0),                  /* r0 = 0 */
               BPF_EXIT_INSN(),                              /* return r0 */
           };
           prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET, prog, sizeof(prog), "GPL");
           assert(prog_fd >= 0);

           sock = open_raw_sock("lo");

           assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
                             sizeof(prog_fd)) == 0);

           for (;;) {
               key = 6;
               assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
               key = 17;
               assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
               printf("TCP %lld UDP %lld packets0, tcp_cnt, udp_cnt);
               sleep(1);
           }

           return 0;
       }

RETURN VALUE
       For a successful call, the return value depends on the operation:

       BPF_MAP_CREATE
              The new file descriptor associated with eBPF map.

       BPF_PROG_LOAD
              The new file descriptor associated with eBPF program.

       All other commands
              Zero.

       On error, -1 is returned, and errno is set appropriately.

ERRORS
       EPERM  bpf() syscall was made without sufficient privilege (without the
              CAP_SYS_ADMIN capability).

       ENOMEM Cannot allocate sufficient memory.

       EBADF  fd is not an open file descriptor

       EFAULT One  of  the  pointers  (  key or value or log_buf or insns ) is
              outside accessible address space.

       EINVAL The value specified in cmd is not recognized by this kernel.

       EINVAL For BPF_MAP_CREATE, either map_type or attributes are invalid.

       EINVAL For BPF_MAP_*_ELEM  commands,  some  of  the  fields  of  "union
              bpf_attr" unused by this command are not set to zero.

       EINVAL For BPF_PROG_LOAD, attempt to load invalid program (unrecognized
              instruction or uses reserved fields or jumps  out  of  range  or
              loop detected or calls unknown function).

       EACCES For BPF_PROG_LOAD, though program has valid instructions, it was
              rejected, since it was  deemed  unsafe  (may  access  disallowed
              memory   region  or  uninitialized  stack/register  or  function
              constraints don't match actual types or misaligned  access).  In
              such case it is recommended to call bpf() again with log_level =
              1 and examine log_buf for specific reason provided by verifier.

       ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM,  indicates  that
              element with given key was not found.

       E2BIG  program  is  too  large  or a map reached max_entries limit (max
              number of elements).

NOTES
       These commands may be used only by a privileged process (one having the
       CAP_SYS_ADMIN capability).

SEE ALSO
       eBPF    architecture    and    instruction    set   is   explained   in
       Documentation/networking/filter.txt

Linux                             2014-09-16                            BPF(2)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:40 -04:00
Alexei Starovoitov
3c731eba48 bpf: mini eBPF library, test stubs and verifier testsuite
1.
the library includes a trivial set of BPF syscall wrappers:
int bpf_create_map(int key_size, int value_size, int max_entries);
int bpf_update_elem(int fd, void *key, void *value);
int bpf_lookup_elem(int fd, void *key, void *value);
int bpf_delete_elem(int fd, void *key);
int bpf_get_next_key(int fd, void *key, void *next_key);
int bpf_prog_load(enum bpf_prog_type prog_type,
		  const struct sock_filter_int *insns, int insn_len,
		  const char *license);
bpf_prog_load() stores verifier log into global bpf_log_buf[] array

and BPF_*() macros to build instructions

2.
test stubs configure eBPF infra with 'unspec' map and program types.
These are fake types used by user space testsuite only.

3.
verifier tests valid and invalid programs and expects predefined
error log messages from kernel.
40 tests so far.

$ sudo ./test_verifier
 #0 add+sub+mul OK
 #1 unreachable OK
 #2 unreachable2 OK
 #3 out of range jump OK
 #4 out of range jump2 OK
 #5 test1 ld_imm64 OK
 ...

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
17a5267067 bpf: verifier (add verifier core)
This patch adds verifier core which simulates execution of every insn and
records the state of registers and program stack. Every branch instruction seen
during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
it pops the state from the stack and continues until it reaches BPF_EXIT again.
For program:
1: bpf_mov r1, xxx
2: if (r1 == 0) goto 5
3: bpf_mov r0, 1
4: goto 6
5: bpf_mov r0, 2
6: bpf_exit
The verifier will walk insns: 1, 2, 3, 4, 6
then it will pop the state recorded at insn#2 and will continue: 5, 6

This way it walks all possible paths through the program and checks all
possible values of registers. While doing so, it checks for:
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
- instruction encoding is not using reserved fields

Kernel subsystem configures the verifier with two callbacks:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
  that provides information to the verifer which fields of 'ctx'
  are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
  returns argument constraints of kernel helper functions that eBPF program
  may call, so that verifier can checks that R1-R5 types match the prototype

More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
475fb78fbf bpf: verifier (add branch/goto checks)
check that control flow graph of eBPF program is a directed acyclic graph

check_cfg() does:
- detect loops
- detect unreachable instructions
- check that program terminates with BPF_EXIT insn
- check that all branches are within program boundary

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
0246e64d9a bpf: handle pseudo BPF_LD_IMM64 insn
eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
to refer to process-local map_fd. Scan the program for such instructions and
if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
by verifier to check access to maps in bpf_map_lookup/update() calls.
If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
BPF_PSEUDO_MAP_FD flag.

Note that eBPF interpreter is generic and knows nothing about pseudo insns.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
cbd3570086 bpf: verifier (add ability to receive verification log)
add optional attributes for BPF_PROG_LOAD syscall:
union bpf_attr {
    struct {
	...
	__u32         log_level; /* verbosity level of eBPF verifier */
	__u32         log_size;  /* size of user buffer */
	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
    };
};

when log_level > 0 the verifier will return its verification log in the user
supplied buffer 'log_buf' which can be used by program author to analyze why
verifier rejected given program.

'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
provides several examples of these messages, like the program:

  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
  BPF_LD_MAP_FD(BPF_REG_1, 0),
  BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
  BPF_EXIT_INSN(),

will be rejected with the following multi-line message in log_buf:

  0: (7a) *(u64 *)(r10 -8) = 0
  1: (bf) r2 = r10
  2: (07) r2 += -8
  3: (b7) r1 = 0
  4: (85) call 1
  5: (15) if r0 == 0x0 goto pc+1
   R0=map_ptr R10=fp
  6: (7a) *(u64 *)(r0 +4) = 0
  misaligned access off 4 size 8

The format of the output can change at any time as verifier evolves.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:15 -04:00
Alexei Starovoitov
51580e798c bpf: verifier (add docs)
this patch adds all of eBPF verfier documentation and empty bpf_check()

The end goal for the verifier is to statically check safety of the program.

Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
0a542a86d7 bpf: handle pseudo BPF_CALL insn
in native eBPF programs userspace is using pseudo BPF_CALL instructions
which encode one of 'enum bpf_func_id' inside insn->imm field.
Verifier checks that program using correct function arguments to given func_id.
If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
replacing func_id with in-kernel function pointer.
eBPF interpreter just calls the function.

In-kernel eBPF users continue to use generic BPF_CALL.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
09756af468 bpf: expand BPF syscall with program load/unload
eBPF programs are similar to kernel modules. They are loaded by the user
process and automatically unloaded when process exits. Each eBPF program is
a safe run-to-completion set of instructions. eBPF verifier statically
determines that the program terminates and is safe to execute.

The following syscall wrapper can be used to load the program:
int bpf_prog_load(enum bpf_prog_type prog_type,
                  const struct bpf_insn *insns, int insn_cnt,
                  const char *license)
{
    union bpf_attr attr = {
        .prog_type = prog_type,
        .insns = ptr_to_u64(insns),
        .insn_cnt = insn_cnt,
        .license = ptr_to_u64(license),
    };

    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}
where 'insns' is an array of eBPF instructions and 'license' is a string
that must be GPL compatible to call helper functions marked gpl_only

Upon succesful load the syscall returns prog_fd.
Use close(prog_fd) to unload the program.

User space tests and examples follow in the later patches

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
db20fd2b01 bpf: add lookup/update/delete/iterate methods to BPF maps
'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given type and attributes
  fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
  returns fd or negative error

- lookup key in a given map referenced by fd
  err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key

- iterate map elements (based on input key return next_key)
  err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->next_key

- close(fd) deletes the map

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
749730ce42 bpf: enable bpf syscall on x64 and i386
done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov
99c55f7d47 bpf: introduce BPF syscall and maps
BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

Userspace example:
/* this syscall wrapper creates a map with given type and attributes
 * and returns map_fd on success.
 * use close(map_fd) to delete the map
 */
int bpf_create_map(enum bpf_map_type map_type, int key_size,
                   int value_size, int max_entries)
{
    union bpf_attr attr = {
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}

'union bpf_attr' is backwards compatible with future extensions.

More details in Documentation/networking/filter.txt and in manpage

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Linus Torvalds
c6ff6486e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fix from Dmitry Torokhov:
 "A small fixup to i8042 adding Asus X450LCP to the nomux list"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: i8042 - fix Asus X450LCP touchpad detection
2014-09-26 11:04:31 -07:00
Pablo Neira Ayuso
34666d467c netfilter: bridge: move br_netfilter out of the core
Jesper reported that br_netfilter always registers the hooks since
this is part of the bridge core. This harms performance for people that
don't need this.

This patch modularizes br_netfilter so it can be rmmod'ed, thus,
the hooks can be unregistered. I think the bridge netfilter should have
been a separated module since the beginning, Patrick agreed on that.

Note that this is breaking compatibility for users that expect that
bridge netfilter is going to be available after explicitly 'modprobe
bridge' or via automatic load through brctl.

However, the damage can be easily undone by modprobing br_netfilter.
The bridge core also spots a message to provide a clue to people that
didn't notice that this has been deprecated.

On top of that, the plan is that nftables will not rely on this software
layer, but integrate the connection tracking into the bridge layer to
enable stateful filtering and NAT, which is was bridge netfilter users
seem to require.

This patch still keeps the fake_dst_ops in the bridge core, since this
is required by when the bridge port is initialized. So we can safely
modprobe/rmmod br_netfilter anytime.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2014-09-26 18:42:31 +02:00
Pablo Neira Ayuso
7276ca3fa2 netfilter: bridge: nf_bridge_copy_header as static inline in header
Move nf_bridge_copy_header() as static inline in netfilter_bridge.h
header file. This patch prepares the modularization of the br_netfilter
code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-09-26 18:42:30 +02:00
Rob Jones
772476df70 net/netfilter/x_tables.c: use __seq_open_private()
Reduce boilerplate code by using __seq_open_private() instead of seq_open()
in xt_match_open() and xt_target_open().

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-09-26 18:42:29 +02:00
Linus Torvalds
d2865c7d4e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A CONFIG_STACK_GROWSUP=y fix, and a hotplug llc CPU mask fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix unreleased llc_shared_mask bit during CPU hotplug
  sched: Fix end_of_stack() and location of stack canary for architectures using CONFIG_STACK_GROWSUP
2014-09-26 08:38:09 -07:00
Linus Torvalds
8207649c41 Merge branch 'akpm' (fixes from Andrew Morton)
Merge fixes from Andrew Morton:
 "9 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: softdirty: keep bit when zapping file pte
  fs/cachefiles: add missing \n to kerror conversions
  genalloc: fix device node resource counter
  drivers/rtc/rtc-efi.c: add missing module alias
  mm, slab: initialize object alignment on cache creation
  mm: softdirty: addresses before VMAs in PTE holes aren't softdirty
  ocfs2/dlm: do not get resource spinlock if lockres is new
  nilfs2: fix data loss with mmap()
  ocfs2: free vol_label in ocfs2_delete_osb()
2014-09-26 08:11:43 -07:00
Peter Feiner
dbab31aa2c mm: softdirty: keep bit when zapping file pte
This fixes the same bug as b43790eedd ("mm: softdirty: don't forget to
save file map softdiry bit on unmap") and 9aed8614af ("mm/memory.c:
don't forget to set softdirty on file mapped fault") where the return
value of pte_*mksoft_dirty was being ignored.

To be sure that no other pte/pmd "mk" function return values were being
ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
with __must_check and rebuilt.

The userspace effect of this bug is that the softdirty mark might be
lost if a file mapped pte get zapped.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Fabian Frederick
6ff66ac77a fs/cachefiles: add missing \n to kerror conversions
Commit 0227d6abb3 ("fs/cachefiles: replace kerror by pr_err") didn't
include newline featuring in original kerror definition

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>	[3.16.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Vladimir Zapolskiy
6f3aabd183 genalloc: fix device node resource counter
Decrement the np_pool device_node refcount, which was incremented on
the preceding of_parse_phandle() call.

Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Pali Rohár
451ff6d409 drivers/rtc/rtc-efi.c: add missing module alias
Without proper alias kernel module is not loaded for rtc-efi driver.

Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Cc: dann frazier <dannf@dannf.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
David Rientjes
d4a5fca592 mm, slab: initialize object alignment on cache creation
Since commit 4590685546 ("mm/sl[aou]b: Common alignment code"), the
"ralign" automatic variable in __kmem_cache_create() may be used as
uninitialized.

The proper alignment defaults to BYTES_PER_WORD and can be overridden by
SLAB_RED_ZONE or the alignment specified by the caller.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrei Elovikov <a.elovikov@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Peter Feiner
87e6d49a00 mm: softdirty: addresses before VMAs in PTE holes aren't softdirty
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap.  This
bug was introduced in commit 68b5a65248 ("mm: softdirty: respect
VM_SOFTDIRTY in PTE holes").  That commit made /proc/pid/pagemap look at
VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
returned by find_vma.

Tested:
  Wrote a selftest that creates a PMD-sized VMA then unmaps the first
  page and asserts that the page is not softdirty. I'm going to send the
  pagemap selftest in a later commit.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jamie Liu <jamieliu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Joseph Qi
5760a97c71 ocfs2/dlm: do not get resource spinlock if lockres is new
There is a deadlock case which reported by Guozhonghua:
  https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html

This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.

It was introduced by commit 8d400b81cc ("ocfs2/dlm: Clean up refmap
helpers").  Since lockres is new, it doesn't not require the
&res->spinlock.  So remove it.

Fixes: 8d400b81cc ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Andreas Rohner
56d7acc792 nilfs2: fix data loss with mmap()
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system.  It is easily
reproducible with the following script:

  ----------------[BEGIN SCRIPT]--------------------
  mkfs.nilfs2 -f /dev/sdb
  mount /dev/sdb /mnt

  dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

  /root/mmaptest/mmaptest /mnt/testfile 30 10 5

  sync
  CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
  umount /mnt

  echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
  echo "AFTER MMAP:\t$CHECKSUM_AFTER"
  echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
  ----------------[END SCRIPT]--------------------

The mmaptest tool looks something like this (very simplified, with
error checking removed):

  ----------------[BEGIN mmaptest]--------------------
  data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
              MAP_SHARED, fd, file_offset);

  for (i = 0; i < write_count; ++i) {
        memcpy(data + i * 4096, buf, sizeof(buf));
        msync(data, file_size - file_offset, MS_SYNC))
  }
  ----------------[END mmaptest]--------------------

The output of the script looks something like this:

  BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
  AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
  AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount.  This can be reproduced a 100% of the time.  The problem was
introduced in commit 136e8770cd ("nilfs2: fix issue of
nilfs_set_page_dirty() for page at EOF boundary").

If the page was read with mpage_readpage() or mpage_readpages() for
example, then it has no buffers attached to it.  In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if the
page has no buffers attached to it.

[akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Joseph Qi
f13a568e5a ocfs2: free vol_label in ocfs2_delete_osb()
osb->vol_label is malloced in ocfs2_initialize_super but not freed if
error occurs or during umount, thus causing a memory leak.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Markos Chandras
8a574cfa26 MIPS: mcount: Adjust stack pointer for static trace in MIPS32
Every mcount() call in the MIPS 32-bit kernel is done as follows:

[...]
move at, ra
jal _mcount
addiu sp, sp, -8
[...]

but upon returning from the mcount() function, the stack pointer
is not adjusted properly. This is explained in details in 58b69401c7
(MIPS: Function tracer: Fix broken function tracing).

Commit ad8c396936 ("MIPS: Unbreak function tracer for 64-bit kernel.)
fixed the stack manipulation for 64-bit but it didn't fix it completely
for MIPS32.

Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Cc: <stable@vger.kernel.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/7792/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-09-26 11:41:17 +02:00
Paul Burton
c8c0da6bdf MIPS: Fix MFC1 & MFHC1 emulation for 64-bit MIPS systems
Commit bbd426f542 "MIPS: Simplify FP context access" modified the
SIFROMREG & SIFROMHREG macros such that they return unsigned rather
than signed 32b integers. I had believed that to be fine, but
inadvertently missed the MFC1 & MFHC1 cases which write to a struct
pt_regs regs element. On MIPS32 this is fine, but on 64 bit those
saved regs' fields are 64 bit wide. Using unsigned values caused the
32 bit value from the FP register to be zero rather than sign extended
as the architecture specifies, causing incorrect emulation of the
MFC1 & MFHc1 instructions. Fix by reintroducing the casts to signed
integers, and therefore the sign extension.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: stable@vger.kernel.org # v3.15+
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/7848/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-09-26 11:33:11 +02:00
Steffen Klassert
d61746b2e7 ip_tunnel: Don't allow to add the same tunnel multiple times.
When we try to add an already existing tunnel, we don't return
an error. Instead we continue and call ip_tunnel_update().
This means that we can change existing tunnels by adding
the same tunnel multiple times. It is even possible to change
the tunnel endpoints of the fallback device.

We fix this by returning an error if we try to add an existing
tunnel.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:41:30 -04:00
Eric Dumazet
4a8e320c92 net: sched: use pinned timers
While using a MQ + NETEM setup, I had confirmation that the default
timer migration ( /proc/sys/kernel/timer_migration ) is killing us.

Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
queues) :

EST="est 1sec 4sec"
for ETH in eth1
do
 tc qd del dev $ETH root 2>/dev/null
 tc qd add dev $ETH root handle 1: mq
 tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
 tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
 tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
 tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
 tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
 tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
 tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
 tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
done

We can see that timers get migrated into a single cpu, presumably idle
at the time timers are set up.
Then all qdisc dequeues run from this cpu and huge lock contention
happens. This single cpu is stuck in softirq mode and cannot dequeue
fast enough.

    39.24%  [kernel]          [k] _raw_spin_lock
     2.65%  [kernel]          [k] netem_enqueue
     1.80%  [kernel]          [k] netem_dequeue
     1.63%  [kernel]          [k] copy_user_enhanced_fast_string
     1.45%  [kernel]          [k] _raw_spin_lock_bh

By pinning qdisc timers on the cpu running the qdisc, we respect proper
XPS setting and remove this lock contention.

     5.84%  [kernel]          [k] netem_enqueue
     4.83%  [kernel]          [k] _raw_spin_lock
     2.92%  [kernel]          [k] copy_user_enhanced_fast_string

Current Qdiscs that benefit from this change are :

	netem, cbq, fq, hfsc, tbf, htb.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:26:48 -04:00
David S. Miller
9fb426a642 Merge branch 'gso_send_check'
Tom Herbert says:

====================
net: Eliminate gso_send_check

gso_send_check presents a lot of complexity for what it is being used
for. It seems that there are only two cases where it might be effective:
TCP and UFO paths. In these cases, the gso_send_check function
initializes the TCP or UDP checksum respectively to the pseudo header
checksum so that the checksum computation is appropriately offloaded or
computed in the gso_segment functions. The gso_send_check functions
are only called from dev.c in skb_mac_gso_segment when ip_summed !=
CHECKSUM_PARTIAL (which seems very unlikely in TCP case). We can move
the logic of this into the respective gso_segment functions where the
checksum is initialized if ip_summed != CHECKSUM_PARTIAL.

With the above cases handled, gso_send_check is no longer needed, so
we can remove all uses of it and the fields in the offload callbacks.
With this change, ip_summed in the skb should be preserved though all
the layers of gso_segment calls.

In follow-on patches, we may be able to remove the check setup code in
tcp_gso_segment if we can guarantee that ip_summed will always be
CHECKSUM_PARTIAL (verify all paths and probably add an assert in
tcp_gro_segment).

Tested these patches by:
  - netperf TCP_STREAM test with GSO enabled
  - Forced ip_summed != CHECKSUM_PARTIAL with above
  - Ran UDP_RR with 10000 request size over GRE tunnel. This exercised
    UFO path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:23:13 -04:00
Tom Herbert
53e5039896 net: Remove gso_send_check as an offload callback
The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:22:47 -04:00
Tom Herbert
f71470b37e udp: move logic out of udp[46]_ufo_send_check
In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo
header checksum. We can move this logic into udp[46]_ufo_fragment.
After this change udp[64]_ufo_send_check is a no-op.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:22:46 -04:00
Tom Herbert
d020f8f733 tcp: move logic out of tcp_v[64]_gso_send_check
In tcp_v[46]_gso_send_check the TCP checksum is initialized to the
pseudo header checksum using __tcp_v[46]_send_check. We can move this
logic into new tcp[46]_gso_segment functions to be done when
ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be
the common case, possibly always true when taking GSO path). After this
change tcp_v[46]_gso_send_check is no-op.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:22:46 -04:00