From 4108b3d96273784f697dd6d8e59ef9203a10a02d Mon Sep 17 00:00:00 2001 From: Len Brown Date: Tue, 16 Dec 2014 01:52:06 -0500 Subject: [PATCH 1/5] cpuidle: menu: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID When menu sees CPUIDLE_FLAG_TIME_INVALID, it ignores its timestamps, and assumes that idle lasted as long as the time till next predicted timer expiration. But if an interrupt was seen and serviced before that duration, it would actually be more accurate to use the measured time rather than rounding up to the next predicted timer expiration. And if an interrupt is seen and serviced such that the mesured time exceeds the time till next predicted timer expiration, then truncating to that expiration is the right thing to do -- since we can never stay idle past that timer expiration. So the code can do a better job without checking for CPUIDLE_FLAG_TIME_INVALID. Signed-off-by: Len Brown Acked-by: Daniel Lezcano Reviewed-by: Tuukka Tikkanen Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/governors/menu.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 659d7b0c9ebf..40580794e23d 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -396,8 +396,8 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) * power state and occurrence of the wakeup event. * * If the entered idle state didn't support residency measurements, - * we are basically lost in the dark how much time passed. - * As a compromise, assume we slept for the whole expected time. + * we use them anyway if they are short, and if long, + * truncate to the whole expected time. * * Any measured amount of time will include the exit latency. * Since we are interested in when the wakeup begun, not when it @@ -405,23 +405,18 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) * the measured amount of time is less than the exit latency, * assume the state was never reached and the exit latency is 0. */ - if (unlikely(target->flags & CPUIDLE_FLAG_TIME_INVALID)) { - /* Use timer value as is */ + + /* measured value */ + measured_us = cpuidle_get_last_residency(dev); + + /* Deduct exit latency */ + if (measured_us > target->exit_latency) + measured_us -= target->exit_latency; + + /* Make sure our coefficients do not exceed unity */ + if (measured_us > data->next_timer_us) measured_us = data->next_timer_us; - } else { - /* Use measured value */ - measured_us = cpuidle_get_last_residency(dev); - - /* Deduct exit latency */ - if (measured_us > target->exit_latency) - measured_us -= target->exit_latency; - - /* Make sure our coefficients do not exceed unity */ - if (measured_us > data->next_timer_us) - measured_us = data->next_timer_us; - } - /* Update our correction ratio */ new_factor = data->correction_factor[data->bucket]; new_factor -= new_factor / DECAY; From b73026b9c959600bcd65eeae7a5f7ac00ded886f Mon Sep 17 00:00:00 2001 From: Len Brown Date: Tue, 16 Dec 2014 01:52:07 -0500 Subject: [PATCH 2/5] cpuidle: ladder: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID When the ladder governor sees the CPUIDLE_FLAG_TIME_INVALID flag, it unconditionally causes a state promotion by setting last_residency to a number higher than the state's promotion_time: last_residency = last_state->threshold.promotion_time + 1 It does this for fear that cpuidle_get_last_residency() will be in-accurate, because cpuidle_enter_state() invoked a state with CPUIDLE_FLAG_TIME_INVALID. But the only state with CPUIDLE_FLAG_TIME_INVALID is acpi_safe_halt(), which may return well after its actual idle duration because it enables interrupts, so cpuidle_enter_state() also measures interrupt service time. So what? In ladder, a huge invalid last_residency has exactly the same effect as the current code -- it unconditionally causes a state promotion. In the case where the idle residency plus measured interrupt handling time is less than the state's demotion_time -- we should use that timestamp to give ladder a chance to demote, rather than unconditionally promoting. This can be done by simply ignoring the CPUIDLE_FLAG_TIME_INVALID, and using the "invalid" time, as it is either equal to what we are doing today, or better. Signed-off-by: Len Brown Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- drivers/cpuidle/governors/ladder.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c index 37263d9a1051..401c0106ed34 100644 --- a/drivers/cpuidle/governors/ladder.c +++ b/drivers/cpuidle/governors/ladder.c @@ -79,12 +79,7 @@ static int ladder_select_state(struct cpuidle_driver *drv, last_state = &ldev->states[last_idx]; - if (!(drv->states[last_idx].flags & CPUIDLE_FLAG_TIME_INVALID)) { - last_residency = cpuidle_get_last_residency(dev) - \ - drv->states[last_idx].exit_latency; - } - else - last_residency = last_state->threshold.promotion_time + 1; + last_residency = cpuidle_get_last_residency(dev) - drv->states[last_idx].exit_latency; /* consider promotion */ if (last_idx < drv->state_count - 1 && From 62c4cf97e82cf79446642e599d155884f600cf17 Mon Sep 17 00:00:00 2001 From: Len Brown Date: Tue, 16 Dec 2014 01:52:08 -0500 Subject: [PATCH 3/5] cpuidle / ACPI: remove unused CPUIDLE_FLAG_TIME_INVALID CPUIDLE_FLAG_TIME_INVALID is no longer checked by menu or ladder cpuidle governors, so don't bother setting or defining it. It was originally invented to account for the fact that acpi_safe_halt() enables interrupts to invoke HLT. That would allow interrupt service routines to be included in the last_idle duration measurements made in cpuidle_enter_state(), potentially returning a duration much larger than reality. But menu and ladder can gracefully handle erroneously large duration intervals without checking for CPUIDLE_FLAG_TIME_INVALID. Further, if they don't check CPUIDLE_FLAG_TIME_INVALID, they can also benefit from the instances when the duration interval is not erroneously large. Signed-off-by: Len Brown Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- drivers/acpi/processor_idle.c | 2 -- include/linux/cpuidle.h | 3 --- 2 files changed, 5 deletions(-) diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 380b4b43f361..7afba40e350e 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -985,8 +985,6 @@ static int acpi_processor_setup_cpuidle_states(struct acpi_processor *pr) state->flags = 0; switch (cx->type) { case ACPI_STATE_C1: - if (cx->entry_method != ACPI_CSTATE_FFH) - state->flags |= CPUIDLE_FLAG_TIME_INVALID; state->enter = acpi_idle_enter_c1; state->enter_dead = acpi_idle_play_dead; diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index a07e087f54b2..ab70f3bc44ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -53,7 +53,6 @@ struct cpuidle_state { }; /* Idle State Flags */ -#define CPUIDLE_FLAG_TIME_INVALID (0x01) /* is residency time measurable? */ #define CPUIDLE_FLAG_COUPLED (0x02) /* state applies to multiple cpus */ #define CPUIDLE_FLAG_TIMER_STOP (0x04) /* timer is stopped on this state */ @@ -89,8 +88,6 @@ DECLARE_PER_CPU(struct cpuidle_device, cpuidle_dev); /** * cpuidle_get_last_residency - retrieves the last state's residency time * @dev: the target CPU - * - * NOTE: this value is invalid if CPUIDLE_FLAG_TIME_INVALID is set */ static inline int cpuidle_get_last_residency(struct cpuidle_device *dev) { From 62a041a4f58f32989460e37cb4f9aed5183f357f Mon Sep 17 00:00:00 2001 From: Dmitry Torokhov Date: Tue, 16 Dec 2014 15:09:39 -0800 Subject: [PATCH 4/5] cpufreq-dt: defer probing if OPP table is not ready cpufreq-dt driver supports mode when OPP table is provided by platform code and not device tree. However on certain platforms code that fills OPP table may run after cpufreq driver tries to initialize, so let's report -EPROBE_DEFER if we do not find any entires in OPP table for the CPU. Signed-off-by: Dmitry Torokhov Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq-dt.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c index 9bc2720628a4..538abd50b77c 100644 --- a/drivers/cpufreq/cpufreq-dt.c +++ b/drivers/cpufreq/cpufreq-dt.c @@ -211,6 +211,17 @@ static int cpufreq_init(struct cpufreq_policy *policy) /* OPPs might be populated at runtime, don't check for error here */ of_init_opp_table(cpu_dev); + /* + * But we need OPP table to function so if it is not there let's + * give platform code chance to provide it for us. + */ + ret = dev_pm_opp_get_opp_count(cpu_dev); + if (ret <= 0) { + pr_debug("OPP table is not ready, deferring probe\n"); + ret = -EPROBE_DEFER; + goto out_free_opp; + } + priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) { ret = -ENOMEM; From cb57720bf79688d64854a0a43565aa52303c1f3f Mon Sep 17 00:00:00 2001 From: Ethan Zhao Date: Thu, 18 Dec 2014 15:28:19 +0900 Subject: [PATCH 5/5] cpufreq: fix a NULL pointer dereference in __cpufreq_governor() If ACPI _PPC changed notification happens before governor was initiated while kernel is booting, a NULL pointer dereference will be triggered: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: [] __cpufreq_governor+0x23/0x1e0 PGD 0 Oops: 0000 [#1] SMP ... ... RIP: 0010:[] [] __cpufreq_governor+0x23/0x1e0 RSP: 0018:ffff881fcfbcfbb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff881fd11b3980 RCX: ffff88407fc20000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff881fd11b3980 RBP: ffff881fcfbcfbd8 R08: 0000000000000000 R09: 000000000000000f R10: ffffffff818068d0 R11: 0000000000000043 R12: 0000000000000004 R13: 0000000000000000 R14: ffffffff8196cae0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff881fffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 00000000018ae000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/0:3 (pid: 750, threadinfo ffff881fcfbce000, task ffff881fcf556400) Stack: ffff881fffc17d00 ffff881fcfbcfc18 ffff881fd11b3980 0000000000000000 ffff881fcfbcfc08 ffffffff81470d08 ffff881fd11b3980 0000000000000007 ffff881fcfbcfc18 ffff881fffc17d00 ffff881fcfbcfd28 ffffffff81472e9a Call Trace: [] __cpufreq_set_policy+0x1b8/0x2e0 [] cpufreq_update_policy+0xca/0x150 [] ? cpufreq_update_policy+0x150/0x150 [] acpi_processor_ppc_has_changed+0x71/0x7b [] acpi_processor_notify+0x55/0x115 [] acpi_device_notify+0x19/0x1b [] acpi_ev_notify_dispatch+0x41/0x5f [] acpi_os_execute_deferred+0x27/0x34 The root cause is a race conditon -- cpufreq core and acpi-cpufreq driver were initiated, but cpufreq_governor wasn't and _PPC changed notification happened, __cpufreq_governor() was called within acpi_os_execute_deferred kernel thread context. To fix this panic issue, add pointer checking code in __cpufreq_governor() before pointer policy->governor is to be dereferenced. Signed-off-by: Ethan Zhao Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/cpufreq.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index a09a29c312a9..46bed4f81cde 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2028,6 +2028,12 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, /* Don't start any governor operations if we are entering suspend */ if (cpufreq_suspended) return 0; + /* + * Governor might not be initiated here if ACPI _PPC changed + * notification happened, so check it. + */ + if (!policy->governor) + return -EINVAL; if (policy->governor->max_transition_latency && policy->cpuinfo.transition_latency >