Commit graph

19721 commits

Author SHA1 Message Date
Peter Zijlstra
40767b0dc7 sched/deadline: Fix deadline parameter modification handling
Commit 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().

The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.

Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.

Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.

Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).

This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.

Todo: close class flipping CBS hole.
Todo: implement delayed BW release.

Reported-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Fixes: 67dfa1b756 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 07:42:48 +01:00
Wonhong Kwon
a64fc82c4f PM / hibernate: exclude freed pages from allocated pages printout
hibernate_preallocate_memory() prints out that how many pages are
allocated, but it doesn't take into consideration the pages freed by
free_unnecessary_pages(). Therefore, it always shows the count more
than actually allocated.

Signed-off-by: Wonhong Kwon <wonhong.kwon@lge.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-03 22:53:53 +01:00
Linus Torvalds
00845eb968 sched: don't cause task state changes in nested sleep debugging
Commit 8eb23b9f35 ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont.  But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

 - don't set TASK_RUNNING when the nested condition happens (note: even
   if WARN_ONCE() only _warns_ once, the return value isn't whether the
   warning happened, but whether the condition for the warning was true.
   So despite the warning only happening once, the "if (WARN_ON(..))"
   would trigger for every nested sleep.

 - in the cases where we knowingly disable the warning by using
   "sched_annotate_sleep()", don't change the task state (that is used
   for all core scheduling decisions), instead use '->task_state_change'
   that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)

Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-01 12:23:32 -08:00
Linus Torvalds
6155bc1431 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also an event groups fix, two PMU driver
  fixes and a CPU model variant addition"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Tighten (and fix) the grouping condition
  perf/x86/intel: Add model number for Airmont
  perf/rapl: Fix crash in rapl_scale()
  perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization
  perf probe: Fix probing kretprobes
  perf symbols: Introduce 'for' method to iterate over the symbols with a given name
  perf probe: Do not rely on map__load() filter to find symbols
  perf symbols: Introduce method to iterate symbols ordered by name
  perf symbols: Return the first entry with a given name in find_by_name method
  perf annotate: Fix memory leaks in LOCK handling
  perf annotate: Handle ins parsing failures
  perf scripting perl: Force to use stdbool
  perf evlist: Remove extraneous 'was' on error message
2015-01-30 14:34:55 -08:00
Xunlei Pang
16b269436b sched/deadline: Modify cpudl::free_cpus to reflect rd->online
Currently, cpudl::free_cpus contains all CPUs during init, see
cpudl_init(). When calling cpudl_find(), we have to add rd->span
to avoid selecting the cpu outside the current root domain, because
cpus_allowed cannot be depended on when performing clustered
scheduling using the cpuset, see find_later_rq().

This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
so we can avoid the rd->span operation when calling cpudl_find()
in find_later_rq().

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:39:16 +01:00
Preeti U Murthy
ff6f2d29bd sched/idle: Add missing checks to the exit condition of cpu_idle_poll()
cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
tick_check_broadcast_expired() returns true. The exit condition from
cpu_idle_poll() is tif_need_resched().

However this does not take into account scenarios where cpu_idle_force_poll
changes or tick_check_broadcast_expired() returns false, without setting
the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
thereby wasting power. Add an explicit check on cpu_idle_force_poll and
tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
to avoid this.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150121105655.15279.59626.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:38:52 +01:00
Frederic Weisbecker
a18b5d0181 sched: Fix missing preemption opportunity
If an interrupt fires in cond_resched(), between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation, and that interrupt sets
TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because it's interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().

This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.

Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@kernel.org>
Original-patch-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:38:51 +01:00
Tim Chen
80e3d87b2c sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
This patch adds checks that prevens futile attempts to move rt tasks
to a CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of
a well known OLTP benchmark by 0.7%.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Cc: Suruchi Kadu <suruchi.a.kadu@intel.com>
Cc: Doug Nelson<doug.nelson@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:38:49 +01:00
Ingo Molnar
3847b27224 Merge branch 'sched/urgent' into sched/core
Merge all pending fixes and refresh the tree, before applying new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:28:36 +01:00
Todd E Brandt
c9257f78b4 PM / sleep: export suspend_resume trace event
Export the suspend_resume tracepoint so it can be used
in loadable modules.

Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-30 02:10:41 +01:00
Ingo Molnar
f10698ed68 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 15:42:56 +01:00
Mike Galbraith
bb2bc55a69 sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask
While creating an exclusive cpuset, we passed cpuset_cpumask_can_shrink()
an empty cpumask (cur), and dl_bw_of(cpumask_any(cur)) made boom with it:

 CPU: 0 PID: 6942 Comm: shield.sh Not tainted 3.19.0-master #19
 Hardware name: MEDIONPC MS-7502/MS-7502, BIOS 6.00 PG 12/26/2007
 task: ffff880224552450 ti: ffff8800caab8000 task.ti: ffff8800caab8000
 RIP: 0010:[<ffffffff81073846>]  [<ffffffff81073846>] cpuset_cpumask_can_shrink+0x56/0xb0
 [...]
 Call Trace:
  [<ffffffff810cb82a>] validate_change+0x18a/0x200
  [<ffffffff810cc877>] cpuset_write_resmask+0x3b7/0x720
  [<ffffffff810c4d58>] cgroup_file_write+0x38/0x100
  [<ffffffff811d953a>] kernfs_fop_write+0x12a/0x180
  [<ffffffff8116e1a3>] vfs_write+0xb3/0x1d0
  [<ffffffff8116ed06>] SyS_write+0x46/0xb0
  [<ffffffff8159ced6>] system_call_fastpath+0x16/0x1b

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Fixes: f82f80426f ("sched/deadline: Ensure that updates to exclusive cpusets don't break AC")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422417235.5716.5.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 15:28:15 +01:00
Peter Zijlstra
c3c87e7704 perf: Tighten (and fix) the grouping condition
The fix from 9fc81d8742 ("perf: Fix events installation during
moving group") was incomplete in that it failed to recognise that
creating a group with events for different CPUs is semantically
broken -- they cannot be co-scheduled.

Furthermore, it leads to real breakage where, when we create an event
for CPU Y and then migrate it to form a group on CPU X, the code gets
confused where the counter is programmed -- triggered in practice
as well by me via the perf fuzzer.

Fix this by tightening the rules for creating groups. Only allow
grouping of counters that can be co-scheduled in the same context.
This means for the same task and/or the same cpu.

Fixes: 9fc81d8742 ("perf: Fix events installation during moving group")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 13:17:35 +01:00
Jan Beulich
81907478c4 sched/fair: Avoid using uninitialized variable in preferred_group_nid()
At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28 13:14:12 +01:00
David S. Miller
95f873f2ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/arm/boot/dts/imx6sx-sdb.dts
	net/sched/cls_bpf.c

Two simple sets of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 16:59:56 -08:00
Linus Torvalds
59343cd7c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't OOPS on socket AIO, from Christoph Hellwig.

 2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
    Grumbach.

 3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.

 4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
    Starovoitov.

 5) Lots of crash, memory leak, short TX packet et al bug fixes in
    sh_eth from Ben Hutchings.

 6) Fix memory corruption in SCTP wrt.  INIT collitions, from Daniel
    Borkmann.

 7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
    From Eric Dumazet and Govindarajulu Varadarajan.

 8) Header length calculation fix in mac80211 from Fred Chou.

 9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
    From Ezequiel Garcia.

10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
    tcp diag used.  From Herbert Xu.

11) amd-xgbe programs wrong rx flow control register, from Thomas
    Lendacky.

12) Fix race leading to use after free in ping receive path, from Subash
    Abhinov Kasiviswanathan.

13) Cache redirect routes otherwise we can get a heavy backlog of rcu
    jobs liberating DST_NOCACHE entries.  From Hannes Frederic Sowa.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
  net: don't OOPS on socket aio
  stmmac: prevent probe drivers to crash kernel
  bnx2x: fix napi poll return value for repoll
  ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
  sh_eth: Fix DMA-API usage for RX buffers
  sh_eth: Check for DMA mapping errors on transmit
  sh_eth: Ensure DMA engines are stopped before freeing buffers
  sh_eth: Remove RX overflow log messages
  ping: Fix race in free in receive path
  udp_diag: Fix socket skipping within chain
  can: kvaser_usb: Fix state handling upon BUS_ERROR events
  can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
  can: kvaser_usb: Send correct context to URB completion
  can: kvaser_usb: Do not sleep in atomic context
  ipv4: try to cache dst_entries which would cause a redirect
  samples: bpf: relax test_maps check
  bpf: rcu lock must not be held when calling copy_to_user()
  net: sctp: fix slab corruption from use after free on INIT collisions
  net: mv643xx_eth: Fix highmem support in non-TSO egress path
  sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
  ...
2015-01-27 13:55:36 -08:00
Alexei Starovoitov
8ebe667c41 bpf: rcu lock must not be held when calling copy_to_user()
BUG: sleeping function called from invalid context at mm/memory.c:3732
in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
1 lock held by test_maps/671:
 #0:  (rcu_read_lock){......}, at: [<0000000000264190>] map_lookup_elem+0xe8/0x260
Call Trace:
([<0000000000115b7e>] show_trace+0x12e/0x150)
 [<0000000000115c40>] show_stack+0xa0/0x100
 [<00000000009b163c>] dump_stack+0x74/0xc8
 [<000000000017424a>] ___might_sleep+0x23a/0x248
 [<00000000002b58e8>] might_fault+0x70/0xe8
 [<0000000000264230>] map_lookup_elem+0x188/0x260
 [<0000000000264716>] SyS_bpf+0x20e/0x840

Fix it by allocating temporary buffer to store map element value.

Fixes: db20fd2b01 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-26 17:20:40 -08:00
Linus Torvalds
c976a67b02 Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The lifetime rules of cgroup hierarchies always have been somewhat
  counter-intuitive and cgroup core tried to enforce that hierarchies
  w/o userland-visible usages must die in finite amount of time so that
  the controllers can be reused for other hierarchies; unfortunately,
  this can't be implemented reasonably for the memory controller - the
  kmemcg part doesn't have any way to forcefully drain the existing
  usages, leading to an interruptible hang if a following mount attempts
  to use the controller in any way.

  So, it seems like we're stuck with "hierarchies live on till they die
  whenever that may be" at least for now.  This pretty much confines
  attaching controllers to hierarchies to before the hierarchies are
  actively used by making dynamic configurations post active usages
  unreliable.  This has never been reliable and should be fine in
  practice given how cgroups are used.

  After the patch, hierarchies aren't killed if it isn't already
  drained.  A following mount attempt of the same mount options will
  reuse the existing hierarchy.  Mount attempts with differing options
  will fail w/ -EBUSY"

* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: prevent mount hang due to memory controller lifetime
2015-01-26 15:17:34 -08:00
Borislav Petkov
edb0ec0725 kexec, Kconfig: spell "architecture" properly
Grepping for "archicture" showed it actually twice! Most unusual
spelling error, very interesting. :)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-26 14:36:46 +01:00
Linus Torvalds
14746306af Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "Hopefully the last round of fixes for 3.19

   - regression fix for the LDT changes
   - regression fix for XEN interrupt handling caused by the APIC
     changes
   - regression fixes for the PAT changes
   - last minute fixes for new the MPX support
   - regression fix for 32bit UP
   - fix for a long standing relocation issue on 64bit tagged for stable
   - functional fix for the Hyper-V clocksource tagged for stable
   - downgrade of a pr_err which tends to confuse users

  Looks a bit on the large side, but almost half of it are valuable
  comments"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tsc: Change Fast TSC calibration failed from error to info
  x86/apic: Re-enable PCI_MSI support for non-SMP X86_32
  x86, mm: Change cachemode exports to non-gpl
  x86, tls: Interpret an all-zero struct user_desc as "no segment"
  x86, tls, ldt: Stop checking lm in LDT_empty
  x86, mpx: Strictly enforce empty prctl() args
  x86, mpx: Fix potential performance issue on unmaps
  x86, mpx: Explicitly disable 32-bit MPX support on 64-bit kernels
  x86, hyperv: Mark the Hyper-V clocksource as being continuous
  x86: Don't rely on VMWare emulating PAT MSR correctly
  x86, irq: Properly tag virtualization entry in /proc/interrupts
  x86, boot: Skip relocs when load address unchanged
  x86/xen: Override ACPI IRQ management callback __acpi_unregister_gsi
  ACPI: pci: Do not clear pci_dev->irq in acpi_pci_irq_disable()
  x86/xen: Treat SCI interrupt as normal GSI interrupt
2015-01-25 18:11:17 -08:00
Linus Torvalds
b73f0c8f4b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A set of small fixes:

   - regression fix for exynos_mct clocksource

   - trivial build fix for kona clocksource

   - functional one liner fix for the sh_tmu clocksource

   - two validation fixes to prevent (root only) data corruption in the
     kernel via settimeofday and adjtimex.  Tagged for stable"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: adjtimex: Validate the ADJ_FREQUENCY values
  time: settimeofday: Validate the values of tv from user
  clocksource: sh_tmu: Set cpu_possible_mask to fix SMP broadcast
  clocksource: kona: fix __iomem annotation
  clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
2015-01-25 17:47:34 -08:00
kbuild test robot
4ebbda5251 hrtimer: Make __hrtimer_get_next_event() static
kernel/time/hrtimer.c:444:9: sparse: symbol '__hrtimer_get_next_event' was not declared. Should it be static?

Fixes: 9bc7491906 hrtimer: Prevent stale expiry time in hrtimer_interrupt()
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: kbuild-all@01.org
Link: http://lkml.kernel.org/r/20150123121206.GA4766@snb
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-24 10:53:36 +01:00
Thomas Gleixner
fe31fca35d Couple of items for 3.20
* ktime division optimization
 * Expose a few more y2038-safe timekeeping interfaces
 * RTC core changes to address y2038
 
 Signed-off-by: John Stultz <john.stultz@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUwvXJAAoJEK8vClot3jMxTAoH/1DMT3fuVx6RFjKJ/P1abIB+
 +w3cfEgEWgkSwYmuS0XHq1WppnQ0p0n1GOJcWUPiP9tTGrKcTdp5uG5qMprcga3q
 XoeR8wefkyEKyH4ukStdGKQKot2Vj117TauDtVNPf2eOOBS5pqOw1dYUlwjlMtOj
 45poW5ORNKmBMn90e22k8nlNSI9PebvMh9w6nzeYJWEibdyk96z2TOk1puPTvws/
 ppyNzlhnKckpNb49JVxE8B4DNRpXsUV+aUxRNyRPN4OdqCGzHwIJCyEKi6+nbRyb
 4HMUhfl8eRB2Iu7zHF2a2XEOqJdOjl8i1DsTwr3Vwd3crf4XkXD6WtTtGl2YKkU=
 =YhDu
 -----END PGP SIGNATURE-----

Merge tag 'fortglx-3.20-time' of https://git.linaro.org/people/john.stultz/linux into timers/core

Pull time updates from John Stultz for 3.20:

 * ktime division optimization
 * Expose a few more y2038-safe timekeeping interfaces
 * RTC core changes to address y2038
2015-01-24 10:11:12 +01:00
Xunlei Pang
9a4a445e30 rtc: Convert rtc_set_ntp_time() to use timespec64
rtc_set_ntp_time() uses timespec which is y2038-unsafe,
so modify to use timespec64 which is y2038-safe, then
replace rtc_time_to_tm() with rtc_time64_to_tm().

Also adjust all its call sites(only NTP uses it) accordingly.

Cc: pang.xunlei <pang.xunlei@linaro.org>
Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23 17:21:57 -08:00
John Stultz
d08c0cdd26 time: Expose getboottime64 for in-kernel uses
Adds a timespec64 based getboottime64() implementation
that can be used as we convert internal users of
getboottime away from using timespecs.

Cc: pang.xunlei <pang.xunlei@linaro.org>
Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23 17:21:54 -08:00
Nicolas Pitre
8b618628b2 ktime: Optimize ktime_divns for constant divisors
At least on ARM, do_div() is optimized to turn constant divisors into
an inline multiplication by the reciprocal value at compile time.
However this optimization is missed entirely whenever ktime_divns() is
used and the slow out-of-line division code is used all the time.

Let ktime_divns() use do_div() inline whenever the divisor is constant
and small enough.  This will make things like ktime_to_us() and
ktime_to_ms() much faster.

Cc: Arnd Bergmann <arnd.bergmann@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicolas Pitre <nico@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-23 17:21:31 -08:00
Rickard Strandqvist
d78cb3680c PM / hibernate: Remove unused function
Remove the function get_safe_write_buffer() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-23 23:11:42 +01:00
Nishanth Menon
f5f4eda4c9 PM / QoS: Add debugfs support to view the list of constraints
PM QoS requests are notoriously hard to debug and made even
more so due to their highly dynamic nature. Having visibility
into the internal data representation per constraint allows
us to have much better appreciation of potential issues or
bad usage by drivers in the system.

So introduce for all classes of PM QoS, an entry in
/sys/kernel/debug/pm_qos that shall show all the current
requests as well as the snapshot of the value these requests
boil down to. For example:
==> /sys/kernel/debug/pm_qos/cpu_dma_latency <==
1: 4444: Active
2: 2000000000: Default
3: 2000000000: Default
4: 2000000000: Default
Type=Minimum, Value=4444, Requests: active=1 / total=4

==> /sys/kernel/debug/pm_qos/memory_bandwidth <==
Empty!

...

The actual value listed will have their meaning based
on the QoS it is on, the 'Type' indicates what logic
it would use to collate the information - Minimum,
Maximum, or Sum. Value is the collation of all requests.
This interface also compares the values with the defaults
for the QoS class and marks the ones that are
currently active.

Signed-off-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Dave Gerlach <d-gerlach@ti.com>
Acked-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-23 22:16:21 +01:00
Thomas Gleixner
9bc7491906 hrtimer: Prevent stale expiry time in hrtimer_interrupt()
hrtimer_interrupt() has the following subtle issue:

hrtimer_interrupt()
  lock(cpu_base);
  expires_next = KTIME_MAX;

  expire_timers(CLOCK_MONOTONIC);
  expires = get_next_timer(CLOCK_MONOTONIC);
  if (expires < expires_next)
    expires_next = expires;

  expire_timers(CLOCK_REALTIME);
    unlock(cpu_base);
    wakeup()
    hrtimer_start(CLOCK_MONOTONIC, newtimer);
    lock(cpu_base();  
  expires = get_next_timer(CLOCK_REALTIME);
  if (expires < expires_next)
    expires_next = expires;

So because we already evaluated the next expiring timer of
CLOCK_MONOTONIC we ignore that the expiry time of newtimer might be
earlier than the overall next expiry time in hrtimer_interrupt().

To solve this, remove the caching of the next expiry value from
hrtimer_interrupt() and reevaluate all active clock bases for the next
expiry value. To avoid another code duplication, create a shared
evaluation function and use it for hrtimer_get_next_event(),
hrtimer_force_reprogram() and hrtimer_interrupt().

There is another subtlety in this mechanism:

While hrtimer_interrupt() is running, we want to avoid to touch the
hardware device because we will reprogram it anyway at the end of
hrtimer_interrupt(). This works nicely for hrtimers which get rearmed
via the HRTIMER_RESTART mechanism, because we drop out when the
callback on that CPU is running. But that fails, if a new timer gets
enqueued like in the example above.

This has another implication: While hrtimer_interrupt() is running we
refuse remote enqueueing of timers - see hrtimer_interrupt() and
hrtimer_check_target().

hrtimer_interrupt() tries to prevent this by setting cpu_base->expires
to KTIME_MAX, but that fails if a new timer gets queued.

Prevent both the hardware access and the remote enqueue
explicitely. We can loosen the restriction on the remote enqueue now
due to reevaluation of the next expiry value, but that needs a
seperate patch.

Folded in a fix from Vignesh Radhakrishnan.

Reported-and-tested-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Based-on-patch-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vigneshr@codeaurora.org
Cc: john.stultz@linaro.org
Cc: viresh.kumar@linaro.org
Cc: fweisbec@gmail.com
Cc: cl@linux.com
Cc: stuart.w.hayes@gmail.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1501202049190.5526@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23 12:13:20 +01:00
Lai Jiangshan
4bee96860a smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()
The following race exists in the smpboot percpu threads management:

CPU0	      	   	     CPU1
cpu_up(2)
  get_online_cpus();
  smpboot_create_threads(2);
			     smpboot_register_percpu_thread();
			     for_each_online_cpu();
			       __smpboot_create_thread();
  __cpu_up(2);

This results in a missing per cpu thread for the newly onlined cpu2 and
in a NULL pointer dereference on a consecutive offline of that cpu.

Proctect smpboot_register_percpu_thread() with get_online_cpus() to
prevent that.

[ tglx: Massaged changelog and removed the change in
        smpboot_unregister_percpu_thread() because that's an
        optimization and therefor not stable material. ]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1406777421-12830-1-git-send-email-laijs@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-23 11:33:51 +01:00
Dave Hansen
e9d1b4f3c6 x86, mpx: Strictly enforce empty prctl() args
Description from Michael Kerrisk.  He suggested an identical patch
to one I had already coded up and tested.

commit fe3d197f84 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT.  However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-22 21:11:06 +01:00
Linus Torvalds
193934123c Surprising number of fixes this merge window :(
First two are minor fallout from the param rework which went in this merge
 window.
 
 Next three are a series which fixes a longstanding (but never previously
 reported and unlikely , so no CC stable) race between kallsyms and freeing
 the init section.
 
 Finally, a minor cleanup as our module refcount will now be -1 during
 unload.
 
 Thanks,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUwEmwAAoJENkgDmzRrbjx77kP/1cNQR2eG2sBwokg3q0tvHnQ
 IKqEXErW7NvxRa+RAMEmy2uQoGt6+uNklAbtyJEYM9oR1NieFbPi2yrt9Xn5SAXS
 Brp1S8WYBMilA3W3o6I0trFDRWHdpdtkKIQwLWgJNSEWjbTXh8bSwp/2X1rlOPyI
 ZmphCMOQMU2/uFEyJhTz1WMEV8eVXiRLN8OxSkPxToxdZoGln2U8IBCCCJC9OG+f
 Cf3eMgEcNdEXNcPKqr11NIcHkAx6M6qI/eMDOqk151PslHa8lbis6di9Z87aE0ps
 i8PyrkJGTmgM9cCjXwE8deNseeCmuKYlbPIF+NoxcqtvZstfaMrISwTIEuzV4JHi
 p13YhDxy4XiC3H6pKHub/jo7UCl+wWtFh9SqpqGgduFX/p6FtUHQJm0S0X/DFFZt
 C+2MFVSe6HRHE8B7bFz86+619Qd/rU7+806CLCE+NbYlYAKIBYKzWt/bml6VH3RJ
 OjwXhQqmznWhJjsfD3BUUUpZpHijmylI9gAe2F1oErb8YjRU6gIm7P8hlkOzD7AS
 TfGHPFq2raQcfAiGdVmvkbvvhvYZXnB3WVsAexrYoqrT9I8eEfRI+7SkL75MLR2E
 ikzhJS3SHkAUAd7fUVMt7xMwh0jmhsPjWCCqc13m6UUFoXhTaDgKgPGftltN0bI2
 g85+enZ3/eca6xh/KxvW
 =Kf9b
 -----END PGP SIGNATURE-----

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module and param fixes from Rusty Russell:
 "Surprising number of fixes this merge window :(

  The first two are minor fallout from the param rework which went in
  this merge window.

  The next three are a series which fixes a longstanding (but never
  previously reported and unlikely , so no CC stable) race between
  kallsyms and freeing the init section.

  Finally, a minor cleanup as our module refcount will now be -1 during
  unload"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  module: make module_refcount() a signed integer.
  module: fix race in kallsyms resolution during module load success.
  module: remove mod arg from module_free, rename module_memfree().
  module_arch_freeing_init(): new hook for archs before module->module_init freed.
  param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
  param: initialize store function to NULL if not available.
2015-01-23 06:40:36 +12:00
Johannes Weiner
3c606d35fe cgroup: prevent mount hang due to memory controller lifetime
Since b2052564e6 ("mm: memcontrol: continue cache reclaim from
offlined groups"), re-mounting the memory controller after using it is
very likely to hang.

The cgroup core assumes that any remaining references after deleting a
cgroup are temporary in nature, and synchroneously waits for them, but
the above-mentioned commit has left-over page cache pin its css until
it is reclaimed naturally.  That being said, swap entries and charged
kernel memory have been doing the same indefinite pinning forever, the
bug is just more likely to trigger with left-over page cache.

Reparenting kernel memory is highly impractical, which leaves changing
the cgroup assumptions to reflect this: once a controller has been
mounted and used, it has internal state that is independent from mount
and cgroup lifetime.  It can be unmounted and remounted, but it can't
be reconfigured during subsequent mounts.

Don't offline the controller root as long as there are any children,
dead or alive.  A remount will no longer wait for these old references
to drain, it will simply mount the persistent controller state again.

Reported-by: "Suzuki K. Poulose" <Suzuki.Poulose@arm.com>
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-01-22 10:26:43 -05:00
Thomas Gleixner
5fbaba8603 Merge branch 'fortglx/3.19-stable/time' of https://git.linaro.org/people/john.stultz/linux into timers/urgent
Pull urgent fixes from John Stultz:

  Two urgent fixes for user triggerable time related overflow issues
2015-01-22 12:28:02 +01:00
Rusty Russell
d5db139ab3 module: make module_refcount() a signed integer.
James Bottomley points out that it will be -1 during unload.  It's
only used for diagnostics, so let's not hide that as it could be a
clue as to what's gone wrong.

Cc: Jason Wessel <jason.wessel@windriver.com>
Acked-and-documention-added-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Reviewed-by: Masami Hiramatsu <maasami.hiramatsu.pt@hitachi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-22 11:15:54 +10:30
Josh Poimboeuf
dbed7ddab9 livepatch: fix uninitialized return value
Fix a potentially uninitialized return value in klp_enable_func().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-21 15:22:48 +01:00
Ingo Molnar
f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Linus Torvalds
d4b2d0061d Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "The xfs folks have been running into weird and very rare lockups for
  some time now.  I didn't think this could have been from workqueue
  side because no one else was reporting it.  This time, Eric had a
  kdump which we looked into and it turned out this actually was a
  workqueue bug and the bug has been there since the beginning of
  concurrency managed workqueue.

  A worker pool ensures forward progress of the workqueues associated
  with it by always having at least one worker reserved from executing
  work items.  When the pool is under contention, the idle one tries to
  create more workers for the pool and if that doesn't succeed quickly
  enough, it calls the rescuers to the pool.

  This logic had a subtle race condition in an early exit path.  When a
  worker invokes this manager function, the function may return %false
  indicating that the caller may proceed to executing work items either
  because another worker is already performing the role or conditions
  have changed and the pool is no longer under contention.

  The latter part depended on the assumption that whether more workers
  are necessary or not remains stable while the pool is locked; however,
  pool->nr_running (concurrency count) may change asynchronously and it
  getting bumped from zero asynchronously could send off the last idle
  worker to execute work items.

  The race window is fairly narrow, and, even when it gets triggered,
  the pool deadlocks iff if all work items get blocked on pending work
  items of the pool, which is highly unlikely but can be triggered by
  xfs.

  The patch removes the race window by removing the early exit path,
  which doesn't server any purpose anymore anyway"

* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix subtle pool management issue which can stall whole worker_pool
2015-01-21 07:51:46 +12:00
Josh Poimboeuf
3c33f5b99d livepatch: support for repatching a function
Add support for patching a function multiple times.  If multiple patches
affect a function, the function in the most recently enabled patch
"wins".  This enables a cumulative patch upgrade path, where each patch
is a superset of previous patches.

This requires restructuring the data a little bit.  With the current
design, where each klp_func struct has its own ftrace_ops, we'd have to
unregister the old ops and then register the new ops, because
FTRACE_OPS_FL_IPMODIFY prevents us from having two ops registered for
the same function at the same time.  That would leave a regression
window where the function isn't patched at all (not good for a patch
upgrade path).

This patch replaces the per-klp_func ftrace_ops with a global klp_ops
list, with one ftrace_ops per original function.  A single ftrace_ops is
shared between all klp_funcs which have the same old_addr.  This allows
the switch between function versions to happen instantaneously by
updating the klp_ops struct's func_stack list.  The winner is the
klp_func at the top of the func_stack (front of the list).

[ jkosina@suse.cz: turn WARN_ON() into WARN_ON_ONCE() in ftrace handler to
  avoid storm in pathological cases ]

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20 20:09:41 +01:00
Josh Poimboeuf
83a90bb134 livepatch: enforce patch stacking semantics
Only allow the topmost patch on the stack to be enabled or disabled, so
that patches can't be removed or added in an arbitrary order.

Suggested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20 20:09:41 +01:00
Miroslav Benes
32b7eb8771 livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING
Change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING in Kconfigs. HAVE_
bools are prevalent there and we should go with the flow.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-20 15:02:25 +01:00
Rusty Russell
c749637909 module: fix race in kallsyms resolution during module load success.
The kallsyms routines (module_symbol_name, lookup_module_* etc) disable
preemption to walk the modules rather than taking the module_mutex:
this is because they are used for symbol resolution during oopses.

This works because there are synchronize_sched() and synchronize_rcu()
in the unload and failure paths.  However, there's one case which doesn't
have that: the normal case where module loading succeeds, and we free
the init section.

We don't want a synchronize_rcu() there, because it would slow down
module loading: this bug was introduced in 2009 to speed module
loading in the first place.

Thus, we want to do the free in an RCU callback.  We do this in the
simplest possible way by allocating a new rcu_head: if we put it in
the module structure we'd have to worry about that getting freed.

Reported-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-20 11:38:34 +10:30
Rusty Russell
be1f221c04 module: remove mod arg from module_free, rename module_memfree().
Nothing needs the module pointer any more, and the next patch will
call it from RCU, where the module itself might no longer exist.
Removing the arg is the safest approach.

This just codifies the use of the module_alloc/module_free pattern
which ftrace and bpf use.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: x86@kernel.org
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: nios2-dev@lists.rocketboards.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: netdev@vger.kernel.org
2015-01-20 11:38:33 +10:30
Rusty Russell
d453cded05 module_arch_freeing_init(): new hook for archs before module->module_init freed.
Archs have been abusing module_free() to clean up their arch-specific
allocations.  Since module_free() is also (ab)used by BPF and trace code,
let's keep it to simple allocations, and provide a hook called before
that.

This means that avr32, ia64, parisc and s390 no longer need to implement
their own module_free() at all.  avr32 doesn't need module_finalize()
either.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
2015-01-20 11:38:32 +10:30
Rusty Russell
c772be5231 param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
ignore_lockdep is uninitialized, and sysfs_attr_init() doesn't initialize
it, so memset to 0.

Reported-by: Huang Ying <ying.huang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-20 11:38:31 +10:30
Michael Kerrisk
996636ddae futex: Fix argument handling in futex_lock_pi() calls
This patch fixes two separate buglets in calls to futex_lock_pi():

  * Eliminate unused 'detect' argument
  * Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL

The 'detect' argument of futex_lock_pi() seems never to have been
used (when it was included with the initial PI mutex implementation
in Linux 2.6.18, all checks against its value were disabled by
ANDing against 0 (i.e., if (detect... && 0)), and with
commit 778e9a9c3e, any mention of
this argument in futex_lock_pi() went way altogether. Its presence
now serves only to confuse readers of the code, by giving the
impression that the futex() FUTEX_LOCK_PI operation actually does
use the 'val' argument. This patch removes the argument.

The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes
'timeout' as one of its arguments. This misleads the reader into thinking
that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible
purpose; but it does not.  Indeed, it cannot, because the checks at the
start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations
that do copy_from_user() on the timeout argument. So, in the
FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change
'timeout' to 'NULL'. This patch does that.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Darren Hart <darren@dvhart.com>
Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-19 12:05:32 +01:00
Johannes Berg
053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
Louis Langholtz
fc7f0dd381 kernel: avoid overflow in cmp_range
Avoid overflow possibility.

[ The overflow is purely theoretical, since this is used for memory
  ranges that aren't even close to using the full 64 bits, but this is
  the right thing to do regardless.  - Linus ]

Signed-off-by: Louis Langholtz <lou_langholtz@me.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Anvin <hpa@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-17 10:02:23 +13:00
Tejun Heo
29187a9eea workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
Linus Torvalds
23aa4b416a This holds a few fixes to the ftrace infrastructure as well as
the mixture of function graph tracing and kprobes.
 
 When jprobes and function graph tracing is enabled at the same time
 it will crash the system.
 
   # modprobe jprobe_example
   # echo function_graph > /sys/kernel/debug/tracing/current_tracer
 
 After the first fork (jprobe_example probes it), the system will crash.
 This is due to the way jprobes copies the stack frame and does not
 do a normal function return. This messes up with the function graph
 tracing accounting which hijacks the return address from the stack
 and replaces it with a hook function. It saves the return addresses in
 a separate stack to put back the correct return address when done.
 But because the jprobe functions do not do a normal return, their
 stack addresses are not put back until the function they probe is called,
 which means that the probed function will get the return address of
 the jprobe handler instead of its own.
 
 The simple fix here was to disable function graph tracing while the
 jprobe handler is being called.
 
 While debugging this I found two minor bugs with the function graph
 tracing.
 
 The first was about the function graph tracer sharing its function hash
 with the function tracer (they both get filtered by the same input).
 The changing of the set_ftrace_filter would not sync the function recording
 records after a change if the function tracer was disabled but the
 function graph tracer was enabled. This was due to the update only checking
 one of the ops instead of the shared ops to see if they were enabled and
 should perform the sync. This caused the ftrace accounting to break and
 a ftrace_bug() would be triggered, disabling ftrace until a reboot.
 
 The second was that the check to update records only checked one of the
 filter hashes. It needs to test both the "filter" and "notrace" hashes.
 The "filter" hash determines what functions to trace where as the "notrace"
 hash determines what functions not to trace (trace all but these).
 Both hashes need to be passed to the update code to find out what change
 is being done during the update. This also broke the ftrace record
 accounting and triggered a ftrace_bug().
 
 This patch set also include two more fixes that were reported separately
 from the kprobe issue.
 
 One was that init_ftrace_syscalls() was called twice at boot up.
 This is not a major bug, but that call performed a rather large kmalloc
 (NR_syscalls * sizeof(*syscalls_metadata)). The second call made the first
 one a memory leak, and wastes memory.
 
 The other fix is a regression caused by an update in the v3.19 merge window.
 The moving to enable events early, moved the enabling before PID 1 was
 created. The syscall events require setting the TIF_SYSCALL_TRACEPOINT
 for all tasks. But for_each_process_thread() does not include the swapper
 task (PID 0), and ended up being a nop. A suggested fix was to add
 the init_task() to have its flag set, but I didn't really want to mess
 with PID 0 for this minor bug. Instead I disable and re-enable events again
 at early_initcall() where it use to be enabled. This also handles any other
 event that might have its own reg function that could break at early
 boot up.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUt9vmAAoJEEjnJuOKh9ldLHEIAJ9XrPW2xMIY5yI69jT1F7pv
 PkSRqENnOK0l4UulD52SvIBecQTTBcEEjao4yVGkc7DCJBOws/1LZ5gW8OfNlKjq
 rMB8yaosL1tXJ1ARVPMjcQVy+228zkgTXznwEZCjku1g7LuScQ28qyXsXO7B6yiK
 xKoHqKjygmM/a2aVn+8tdiVKiDp6jdmkbYicbaFT4xP7XB5DaMmIiXRHxdvW6xdR
 azKrVfYiMyJqTZNt/EVSWUk2WjeaYhoXyNtvgPx515wTo/llCnzhjcsocXBtH2P/
 YOtwl+1L7Z89ukV9oXqrtrUJZ6Ps7+g7I1flJuL7/1FlNGnklcP9JojD+t6HeT8=
 =vkec
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "This holds a few fixes to the ftrace infrastructure as well as the
  mixture of function graph tracing and kprobes.

  When jprobes and function graph tracing is enabled at the same time it
  will crash the system:

      # modprobe jprobe_example
      # echo function_graph > /sys/kernel/debug/tracing/current_tracer

  After the first fork (jprobe_example probes it), the system will
  crash.

  This is due to the way jprobes copies the stack frame and does not do
  a normal function return.  This messes up with the function graph
  tracing accounting which hijacks the return address from the stack and
  replaces it with a hook function.  It saves the return addresses in a
  separate stack to put back the correct return address when done.  But
  because the jprobe functions do not do a normal return, their stack
  addresses are not put back until the function they probe is called,
  which means that the probed function will get the return address of
  the jprobe handler instead of its own.

  The simple fix here was to disable function graph tracing while the
  jprobe handler is being called.

  While debugging this I found two minor bugs with the function graph
  tracing.

  The first was about the function graph tracer sharing its function
  hash with the function tracer (they both get filtered by the same
  input).  The changing of the set_ftrace_filter would not sync the
  function recording records after a change if the function tracer was
  disabled but the function graph tracer was enabled.  This was due to
  the update only checking one of the ops instead of the shared ops to
  see if they were enabled and should perform the sync.  This caused the
  ftrace accounting to break and a ftrace_bug() would be triggered,
  disabling ftrace until a reboot.

  The second was that the check to update records only checked one of
  the filter hashes.  It needs to test both the "filter" and "notrace"
  hashes.  The "filter" hash determines what functions to trace where as
  the "notrace" hash determines what functions not to trace (trace all
  but these).  Both hashes need to be passed to the update code to find
  out what change is being done during the update.  This also broke the
  ftrace record accounting and triggered a ftrace_bug().

  This patch set also include two more fixes that were reported
  separately from the kprobe issue.

  One was that init_ftrace_syscalls() was called twice at boot up.  This
  is not a major bug, but that call performed a rather large kmalloc
  (NR_syscalls * sizeof(*syscalls_metadata)).  The second call made the
  first one a memory leak, and wastes memory.

  The other fix is a regression caused by an update in the v3.19 merge
  window.  The moving to enable events early, moved the enabling before
  PID 1 was created.  The syscall events require setting the
  TIF_SYSCALL_TRACEPOINT for all tasks.  But for_each_process_thread()
  does not include the swapper task (PID 0), and ended up being a nop.

  A suggested fix was to add the init_task() to have its flag set, but I
  didn't really want to mess with PID 0 for this minor bug.  Instead I
  disable and re-enable events again at early_initcall() where it use to
  be enabled.  This also handles any other event that might have its own
  reg function that could break at early boot up"

* tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix enabling of syscall events on the command line
  tracing: Remove extra call to init_ftrace_syscalls()
  ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
  ftrace: Check both notrace and filter for old hash
  ftrace: Fix updating of filters for shared global_ops filters
2015-01-17 07:55:52 +13:00