Commit graph

22575 commits

Author SHA1 Message Date
Pavankumar Kondeti
f439dd8a41 sched: fix argument type in update_task_burst()
update_task_burst() function's runtime argument type should
be u64 not int. Fix this to avoid potential overflow.

Change-Id: I33757b7b42f142138c1a099bb8be18c2a3bed331
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-02 11:07:49 +05:30
Eric Dumazet
d1b232c2ce sysctl: fix proc_doulongvec_ms_jiffies_minmax()
commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-01 08:30:52 +01:00
Pavankumar Kondeti
b559daa261 sched: maintain group busy time counters in runqueue
There is no advantage of tracking busy time counters per related
thread group. We need busy time across all groups for either a CPU
or a frequency domain. Hence maintain group busy time counters in
the runqueue itself. When CPU window is rolled over, the group busy
counters are also rolled over. This eliminates the overhead of
individual group's window_start maintenance.

As we are preallocating related thread group now, this patch saves
40 * nr_cpu_ids * (nr_grp - 1) bytes memory.

Change-Id: Ieaaccea483b377f54ea1761e6939ee23a78a5e9c
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-01 09:16:39 +05:30
Linux Build Service Account
3ed889b39e Merge "sched: set LBF_IGNORE_PREFERRED_CLUSTER_TASKS correctly" 2017-01-30 07:05:23 -08:00
Linux Build Service Account
f6e3e8bba4 Merge "sysctl: enable strict writes" 2017-01-30 07:04:51 -08:00
Pavankumar Kondeti
827a31c699 sched: set LBF_IGNORE_PREFERRED_CLUSTER_TASKS correctly
The LBF_IGNORE_PREFERRED_CLUSTER_TASKS flag needs to be set
for all types of inter-cluster load balance. Currently this
is set only when higher capacity CPU is pulling the tasks from
a lower capacity CPU. This can result in the migration of grouped
tasks from higher capacity cluster to lower capacity cluster.

Change-Id: Ib0476c5c85781804798ef49268e1b193859ff5ef
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-27 21:18:12 -08:00
Oleg Nesterov
5cbee2fa5d Use after free from pid_nr_ns()
There is use after free reported due to group
leader task is already freed but other tasks are
still holding the group leader task address in
task->group_leader pointer.

pid_nr_ns+0x10/0x38
cgroup_pidlist_start+0x144/0x400
cgroup_seqfile_start+0x1c/0x24
kernfs_seq_start+0x54/0x90
seq_read+0x15c/0x3a8
kernfs_fop_read+0x38/0x160
__vfs_read+0x28/0xc8
vfs_read+0x84/0xfc

Change-Id: Ib6b3fc75bf0d24a04455bf81d54900c21c434958
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2017-01-23 18:31:21 -08:00
Runmin Wang
778031ccb5 genirq: Add IRQ_AFFINITY_MANAGED flag
Add IRQ_AFFINITY_MANAGED flag and related kernel APIs so that
kernel driver can modify an irq's status in such a way that
user space affinity change will be ignored. Kernel space's
affinity setting will not be changed.

Change-Id: Ib2d5ea651263bff4317562af69079ad950c9e71e
Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-23 16:01:01 -08:00
Thomas Gleixner
1cc869442a genirq: Introduce IRQD_AFFINITY_MANAGED flag
Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Change-Id: I204c49bb1c8ce87fbcd163119093163b120bfe83
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Git-commit: 9c2555835bb3d34dfac52a0be943dcc4bedd650f
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[runminw@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-23 11:23:37 -08:00
Alex Shi
b4bbeeb816 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-01-22 12:01:43 +08:00
Alex Shi
261e8dbdb9 Merge tag 'v4.4.44' into linux-linaro-lsk-v4.4
This is the 4.4.44 stable release
2017-01-22 12:01:41 +08:00
Syed Rameez Mustafa
196069b1bc sched: Update capacity and load scale factor for all clusters at boot
Cluster capacities should reflect differences in efficiency of
different clusters even in the absence of cpufreq. Currently
capacity is updated only when cpufreq policy notifier is received.
Therefore placement is suboptimal when cpufreq is turned off. Fix
this by updating capacities and load scaling factors during cluster
detection.

Change-Id: I47f63c1e374bbfd247a4302525afb37d55334bad
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2017-01-20 16:58:11 -08:00
David Matlack
3d27cd4b25 jump_labels: API for flushing deferred jump label updates
commit b6416e61012429e0277bd15a229222fd17afc1c1 upstream.

Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.

Signed-off-by: David Matlack <dmatlack@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:19 +01:00
Dan Williams
70429b970b mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}
commit f931ab479dd24cf7a2c6e2df19778406892591fb upstream.

Both arch_add_memory() and arch_remove_memory() expect a single threaded
context.

For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
not hold any locks over this check and branch:

    if (pgd_val(*pgd)) {
    	pud = (pud_t *)pgd_page_vaddr(*pgd);
    	paddr_last = phys_pud_init(pud, __pa(vaddr),
    				   __pa(vaddr_end),
    				   page_size_mask);
    	continue;
    }

    pud = alloc_low_page();
    paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
    			   page_size_mask);

The result is that two threads calling devm_memremap_pages()
simultaneously can end up colliding on pgd initialization.  This leads
to crash signatures like the following where the loser of the race
initializes the wrong pgd entry:

    BUG: unable to handle kernel paging request at ffff888ebfff0000
    IP: memcpy_erms+0x6/0x10
    PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
    task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
    RIP: memcpy_erms+0x6/0x10
    [..]
    Call Trace:
      ? pmem_do_bvec+0x205/0x370 [nd_pmem]
      ? blk_queue_enter+0x3a/0x280
      pmem_rw_page+0x38/0x80 [nd_pmem]
      bdev_read_page+0x84/0xb0

Hold the standard memory hotplug mutex over calls to
arch_{add,remove}_memory().

Fixes: 41e94a8513 ("add devm_memremap_pages")
Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:17:18 +01:00
Linux Build Service Account
4f0a0766d1 Merge "sched: kill sync_cpu maintenance" 2017-01-19 09:52:20 -08:00
Linux Build Service Account
fbbaeb656a Merge "sched: hmp: Remove the global sysctl_sched_enable_colocation tunable" 2017-01-18 23:48:38 -08:00
Linux Build Service Account
b7193a89d1 Merge "tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results" 2017-01-18 23:48:36 -08:00
Pavankumar Kondeti
6d63f38bf2 sched: kill sync_cpu maintenance
We assume boot CPU as a sync CPU and initialize it's window_start
to sched_ktime_clock(). As windows are synchronized across all
CPUs, the secondary CPUs' window_start are initialized from the
sync_cpu's window_start. A CPU's window_start is never reset, so
this synchronization happens only once for a given CPU. Given this
fact, there is no need to reassigning the sync_cpu role to another
CPU when the boot CPU is going offline. Remove this unnecessary
maintenance of sync_cpu and use any online CPU's window_start as
reference.

Change-Id: I169a8e80573c6dbcb1edeab0659c07c17102f4c9
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-19 12:30:19 +05:30
Vikram Mulukutla
e7dd50fa46 sched: hmp: Remove the global sysctl_sched_enable_colocation tunable
Colocation in HMP includes a tunable that turns on or off the feature
globally across all colocation groups. Supporting this tunable correctly
would result in complexity that would outweigh any foreseeable benefits.
For example, disabling the feature globally would involve deleting all
colocation groups one by one while ensuring no placement decisions are
made during the process.

Remove the tunable. Adding or removing a task from a colocation group is
still possible and so we're not losing functionality.

Change-Id: I4cb8bcdbee98d3bdd168baacbac345eca9ea8879
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-01-18 09:45:44 -08:00
Vikram Mulukutla
2768f0352b sched: hmp: Ensure that best_cluster() never returns NULL
There are certain conditions under which group_will_fit() may return 0 for
all clusters in the system, especially under changing thermal conditions.
This may result in crashes such as this one:

        CPU 0                    |               CPU 1
====================================================================
select_best_cpu()                |
 -> env.rtg = rtgA               |
    rtgA.pref_cluster=C_big      |
                                 |   set_pref_cluster() for rtgA
                                 |     -> best_cluster()
                                 |        C_little doesn't fit
                                 |
                                 |   IRQ: thermal mitigation
                                 |   C_big capacity now less
                                 |   than C_little capacity
                                 |
                                 |     -> best_cluster() continues
                                 |        C_big doesn't fit
                                 |   set_pref_cluster() sets
                                 |   rtgA.pref_cluster = NULL
                                 |
select_least_power_cluster()     |
  -> cluster_first_cpu()         |
     -> BUG()                    |

To add lock protection around accesses to the group's preferred cluster
would be expensive and defeat the point of the usage of RCU to protect
access to the related_thread_group structure. Therefore, ensure that
best_cluster() can never return NULL. In the worst case, we'll select the
wrong cluster for a related_thread_group's demand, but this should be
fixed in the next tick or wakeup etc. Locking would have still led to the
momentary wrong decision with the additional expense!

Also, don't set preferred cluster to NULL when colocation is disabled.

Change-Id: Id3f514b149add9b3ed33d104fa6a9bd57bec27e2
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-01-18 09:45:40 -08:00
Pavankumar Kondeti
f4f127a9ba tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
The 's' flag is supposed to indicate that a softirq is running. This
can be detected by testing the preempt_count with SOFTIRQ_OFFSET.

The current code tests the preempt_count with SOFTIRQ_MASK, which
would be true even when softirqs are disabled but not serving a
softirq.

Link: http://lkml.kernel.org/r/1481300417-3564-1-git-send-email-pkondeti@codeaurora.org

Change-Id: I084531ce806e0f7d42a38be0a7ad45977c43d158
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Git-commit: c59f29cb144a6a0dfac16ede9dc8eafc02dc56ca
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
2017-01-18 07:05:59 +05:30
Linux Build Service Account
8f796906de Merge "workqueue: fix possible livelock with concurrent mod_delayed_work()" 2017-01-17 17:18:16 -08:00
Linux Build Service Account
1e5081e1b2 Merge "sched: Initialize variables" 2017-01-16 04:29:07 -08:00
Linux Build Service Account
a963750b83 Merge "sched: Fix compilation errors when CFS_BANDWIDTH && !SCHED_HMP" 2017-01-16 04:29:04 -08:00
Linux Build Service Account
a2dbdc2c6e Merge "perf: don't leave group_entry on sibling list (use-after-free)" 2017-01-16 04:29:00 -08:00
Brendan Jackman
ee620ddd65 DEBUG: sched/fair: Fix sched_load_avg_cpu events for task_groups
The current sched_load_avg_cpu event traces the load for any cfs_rq that is
updated. This is not representative of the CPU load - instead we should only
trace this event when the cfs_rq being updated is in the root_task_group.

Change-Id: I345c2f13f6b5718cb4a89beb247f7887ce97ed6b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-01-16 15:03:08 +05:30
Brendan Jackman
52a2ef75c3 DEBUG: sched/fair: Fix missing sched_load_avg_cpu events
update_cfs_rq_load_avg is called from update_blocked_averages without triggering
the sched_load_avg_cpu event. Move the event trigger to inside
update_cfs_rq_load_avg to avoid this missing event.

Change-Id: I6c4f66f687a644e4e7f798db122d28a8f5919b7b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-01-16 15:03:08 +05:30
Linux Build Service Account
a1e7739089 Merge "sched: fix a bug in handling top task table rollover" 2017-01-14 03:42:58 -08:00
Olav Haugan
68b55fe985 sched: Initialize variables
Initialize variable at definition to avoid compiler warning when
compiling with CONFIG_OPTIMIZE_FOR_SIZE=n.

Change-Id: Ibd201877b2274c70ced9d7240d0e527bc77402f3
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2017-01-13 17:06:01 -08:00
Linux Build Service Account
1896b200f9 Merge "perf: protect group_leader from races that cause ctx double-free" 2017-01-13 17:02:47 -08:00
Alex Shi
e30546378e Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-01-13 12:01:52 +08:00
Alex Shi
99d4c5fe0b Merge tag 'v4.4.42' into linux-linaro-lsk-v4.4
This is the 4.4.42 stable release
2017-01-13 12:01:45 +08:00
Kees Cook
22562e0cec sysctl: enable strict writes
SYSCTL_WRITES_WARN was added in commit f4aacea2f5 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014.  Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2].  As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Change-Id: Ibf8d46fa34fa9fd4df3527dc4dfc3e3d31b2f7e0
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 41662f5cc55335807d39404371cfcbb1909304c4
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2017-01-12 16:01:51 -08:00
Thomas Gleixner
6053479cbb tick/broadcast: Prevent NULL pointer dereference
commit c1a9eeb938b5433947e5ea22f89baff3182e7075 upstream.

When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.

If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.

Reported-by: Mason <slash.tmp@free.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Richard Cochran <rcochran@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Sebastian Frias <sf84@laposte.net>
Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12 11:22:51 +01:00
Arnd Bergmann
56ef587b77 stable-fixup: hotplug: fix unused function warning
[resolves a messed up backport, so no matching upstream commit]

The backport of upstream commit 777c6e0daebb ("hotplug: Make
register and unregister notifier API symmetric") to linux-4.4.y
introduced a harmless warning in 'allnoconfig' builds as spotted by
kernelci.org:

kernel/cpu.c:226:13: warning: 'cpu_notify_nofail' defined but not used [-Wunused-function]

So far, this is the only stable tree that is affected, as linux-4.6 and
higher contain commit 984581728eb4 ("cpu/hotplug: Split out cpu down functions")
that makes the function used in all configurations, while older longterm
releases so far don't seem to have a backport of 777c6e0daebb.

The fix for the warning is trivial: move the unused function back
into the #ifdef section where it was before.

Link: https://kernelci.org/build/id/586fcacb59b514049ef6c3aa/logs/
Fixes: 1c0f4e0ebb ("hotplug: Make register and unregister notifier API symmetric") in v4.4.y
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12 11:22:48 +01:00
Pavankumar Kondeti
f6471c2c9d sched: Fix compilation errors when CFS_BANDWIDTH && !SCHED_HMP
There are few compiler errors and warnings when CFS_BANDWIDTH
config is enabled but not SCHED_HMP.

Change-Id: Idaf4a7364564b6faf56df2eb3a1a74eeb242d57e
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-12 08:43:45 +05:30
Pavankumar Kondeti
f1a15235d6 sched: fix compiler errors with !SCHED_HMP
HMP scheduler boost feature related functions are referred in SMP
load balancer. Add the nop functions for the same to fix the
compiler errors with !SCHED_HMP.

Change-Id: I1cbcf67f728c2cbc7c0f47e8eaf1f4165649dce8
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-12 08:43:45 +05:30
Pavankumar Kondeti
d1f21a7e9c workqueue: fix possible livelock with concurrent mod_delayed_work()
When mod_delayed_work() is concurrently executed, there a potential
live lock scenario due to pool->lock contention.

Lets say both CPU#0 and CPU#4 calls mod_delayed_work() on the same
work item with 0 delay on a bounded workqueue. This workitem has
run on CPU#4 previously. CPU#0 wins the work item PENDING bit race
and proceeds to queueing. As this work has previously run on CPU#4,
it tries to acquire the corresponding pool->lock to check if it is
still running there. In the meantime, CPU#4 loops in
try_to_grab_pending() for the workitem to be linked with a pwq so
that it can steal it from pwq->pool->worklist. The CPU#4 essentially
acquires and releases the pool->lock in a busy loop and CPU#0 may
never gets this lock.

----------------                        --------------------
    CPU#0                                          CPU#4
---------------                         --------------------

blk_run_queue_async()

mod_delayed_work_on()                  queue_unplugged()

--> try_to_grab_pending() returns      blk_run_queue_async()
0 indicating PENDING bit is set
now.

__queue_delayed_work()                 mod_delayed_work_on()

__queue_work()                         try_to_grab_pending()

                                       {

--> waiting for the CPU#4's               acquire pool->lock()
pool->lock                                release pool->lock()

                                       }

Change-Id: I9aeab111f55a19478a9d045c8e3576bce3b7a7c5
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-11 11:53:39 +05:30
John Dias
5145fb9b00 perf: don't leave group_entry on sibling list (use-after-free)
When perf_group_detach is called on a group leader,
it should empty its sibling list. Otherwise, when
a sibling is later deallocated, list_del_event()
removes the sibling's group_entry from its current
list, which can be the now-deallocated group leader's
sibling list (use-after-free bug).

Bug: 32402548
Change-Id: I99f6bc97c8518df1cb0035814368012ba72ab1f1
Signed-off-by: John Dias <joaodias@google.com>
Git-repo: https://android.googlesource.com/kernel/msm
Git-commit: 6b6cfb2362f09553b46b3b7e5684b16b6e53e373
Signed-off-by: Dennis Cagle <d-cagle@codeaurora.org>
2017-01-10 14:18:20 -08:00
Syed Rameez Mustafa
47f7e0415a sched: Convert the global wake_up_idle flag to a per cluster flag
Since clusters can vary significantly in the power and performance
characteristics, there may be a need to have different CPU selection
policies based on which cluster a task is being placed on. For example
the placement policy can be more aggressive in using idle CPUs on
cluster that are power efficient and less aggressive on clusters
that are geared towards performance. Add support for per cluster
wake_up_idle flag to allow greater flexibility in placement policies.

Change-Id: I18cd3d907cd965db03a13f4655870dc10c07acfe
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2017-01-10 11:01:52 -08:00
Alex Shi
7785301d92 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-01-10 12:01:14 +08:00
Alex Shi
f02e043c5e Merge tag 'v4.4.41' into linux-linaro-lsk-v4.4
This is the 4.4.41 stable release
2017-01-10 12:01:08 +08:00
Linux Build Service Account
77fadbc47f Merge "sched: fix stale predicted load in trace_sched_get_busy()" 2017-01-09 12:42:38 -08:00
Steven Rostedt (Red Hat)
e945df4c6b fgraph: Handle a case where a tracer ignores set_graph_notrace
commit 794de08a16cf1fc1bf785dc48f66d36218cf6d88 upstream.

Both the wakeup and irqsoff tracers can use the function graph tracer when
the display-graph option is set. The problem is that they ignore the notrace
file, and record the entry of functions that would be ignored by the
function_graph tracer. This causes the trace->depth to be recorded into the
ring buffer. The set_graph_notrace uses a trick by adding a large negative
number to the trace->depth when a graph function is to be ignored.

On trace output, the graph function uses the depth to record a stack of
functions. But since the depth is negative, it accesses the array with a
negative number and causes an out of bounds access that can cause a kernel
oops or corrupt data.

Have the print functions handle cases where a tracer still records functions
even when they are in set_graph_notrace.

Also add warnings if the depth is below zero before accessing the array.

Note, the function graph logic will still prevent the return of these
functions from being recorded, which means that they will be left hanging
without a return. For example:

   # echo '*spin*' > set_graph_notrace
   # echo 1 > options/display-graph
   # echo wakeup > current_tracer
   # cat trace
   [...]
      _raw_spin_lock() {
        preempt_count_add() {
        do_raw_spin_lock() {
      update_rq_clock();

Where it should look like:

      _raw_spin_lock() {
        preempt_count_add();
        do_raw_spin_lock();
      }
      update_rq_clock();

Cc: Namhyung Kim <namhyung.kim@lge.com>
Fixes: 29ad23b004 ("ftrace: Add set_graph_notrace filter")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09 08:07:50 +01:00
Thomas Gleixner
e01b04be3e timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
commit 9c1645727b8fa90d07256fdfcc45bf831242a3ab upstream.

The clocksource delta to nanoseconds conversion is using signed math, but
the delta is unsigned. This makes the conversion space smaller than
necessary and in case of a multiplication overflow the conversion can
become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

Shifting a signed integer right obvioulsy preserves the sign, which has
interesting consequences:

 - Time jumps backwards

 - __iter_div_u64_rem() which is used in one of the calling code pathes
   will take forever to piecewise calculate the seconds/nanoseconds part.

This has been reported by several people with different scenarios:

David observed that when stopping a VM with a debugger:

 "It was essentially the stopped by debugger case.  I forget exactly why,
  but the guest was being explicitly stopped from outside, it wasn't just
  scheduling lag.  I think it was something in the vicinity of 10 minutes
  stopped."

 When lifting the stop the machine went dead.

The stopped by debugger case is not really interesting, but nevertheless it
would be a good thing not to die completely.

But this was also observed on a live system by Liav:

 "When the OS is too overloaded, delta will get a high enough value for the
  msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
  after the shift the nsec variable will gain a value similar to
  0xffffffffff000000."

Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
("time: Add cycles to nanoseconds translation"). It had been fixed a year
ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
timekeeping_get_ns()").

Though it's not surprising that the issue has been reintroduced because the
function itself and the whole call chain uses s64 for the result and the
propagation of it. The change in this recent commit is subtle:

   s64 nsec;

-  nsec = (d * m + n) >> s:
+  nsec = d * m + n;
+  nsec >>= s;

d being type of cycle_t adds another level of obfuscation.

This wouldn't have happened if the previous change to unsigned computation
would have made the 'nsec' variable u64 right away and a follow up patch
had cleaned up the whole call chain.

There have been patches submitted which basically did a revert of the above
patch leaving everything else unchanged as signed. Back to square one. This
spawned a admittedly pointless discussion about potential users which rely
on the unsigned behaviour until someone pointed out that it had been fixed
before. The changelogs of said patches added further confusion as they made
finally false claims about the consequences for eventual users which expect
signed results.

Despite delta being cycle_t, aka. u64, it's very well possible to hand in
a signed negative value and the signed computation will happily return the
correct result. But nobody actually sat down and analyzed the code which
was added as user after the propably unintended signed conversion.

Though in sensitive code like this it's better to analyze it proper and
make sure that nothing relies on this than hunting the subtle wreckage half
a year later. After analyzing all call chains it stands that no caller can
hand in a negative value (which actually would work due to the s64 cast)
and rely on the signed math to do the right thing.

Change the conversion function to unsigned math. The conversion of all call
chains is done in a follow up patch.

This solves the starvation issue, which was caused by the negative result,
but it does not solve the underlying problem. It merily procrastinates
it. When the timekeeper update is deferred long enough that the unsigned
multiplication overflows, then time going backwards is observable again.

It does neither solve the issue of clocksources with a small counter width
which will wrap around possibly several times and cause random time stamps
to be generated. But those are usually not found on systems used for
virtualization, so this is likely a non issue.

I took the liberty to claim authorship for this simply because
analyzing all callsites and writing the changelog took substantially
more time than just making the simple s/s64/u64/ change and ignore the
rest.

Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Liav Rehana <liavr@mellanox.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Parit Bhargava <prarit@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09 08:07:43 +01:00
Alex Shi
19192a140a Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-01-09 12:01:35 +08:00
Alex Shi
eaa88578f2 Merge tag 'v4.4.40' into linux-linaro-lsk-v4.4
This is the 4.4.40 stable release
2017-01-09 12:01:31 +08:00
Pavankumar Kondeti
add97fe0da sched: fix a bug in handling top task table rollover
When frequency aggregation is enabled, there is a possibility of
rolling over the top task table multiple times in a single
window.

For example

- utra() is called with PUT_PREV_TASK for task 'A' which does not
belong to any related thread grp. Lets say window rollover happens.
rq counters and top task table rollover is done.

- utra() is called with PICK_NEXT_TASK/TASK_WAKE for task 'B' which
belongs to a related thread grp. Lets say this happens before
the grp's cpu_time->window_start is in sync with rq->window_start.
In this case, grp's cpu_time counters are rolled over and the
top task table is also rolled over again.

Roll over the top task table in the context of current running task
to fix this.

Change-Id: Iea3075e0ea460a9279a01ba42725890c46edd713
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-07 13:50:04 +05:30
Pavankumar Kondeti
432662eb4d sched: fix stale predicted load in trace_sched_get_busy()
When early detection notification is pending, we skip calculating
predicted load. Initialize it to 0 so that stale value does not
get printed in trace_sched_get_busy().

Change-Id: I36287c0081f6c12191235104666172b7cae2a583
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-07 13:49:22 +05:30
Douglas Anderson
f93777c915 kernel/debug/debug_core.c: more properly delay for secondary CPUs
commit 2d13bb6494c807bcf3f78af0e96c0b8615a94385 upstream.

We've got a delay loop waiting for secondary CPUs.  That loop uses
loops_per_jiffy.  However, loops_per_jiffy doesn't actually mean how
many tight loops make up a jiffy on all architectures.  It is quite
common to see things like this in the boot log:

  Calibrating delay loop (skipped), value calculated using timer
  frequency.. 48.00 BogoMIPS (lpj=24000)

In my case I was seeing lots of cases where other CPUs timed out
entering the debugger only to print their stack crawls shortly after the
kdb> prompt was written.

Elsewhere in kgdb we already use udelay(), so that should be safe enough
to use to implement our timeout.  We'll delay 1 ms for 1000 times, which
should give us a full second of delay (just like the old code wanted)
but allow us to notice that we're done every 1 ms.

[akpm@linux-foundation.org: simplifications, per Daniel]
Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Brian Norris <briannorris@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 11:16:16 +01:00