Prevent __cpufreq_add_dev() from overwriting the existing values of
user_policy.{min|max|policy|governor} with defaults during resume
from system suspend.
Fixes: 5302c3fb2e ("cpufreq: Perform light-weight init/teardown during suspend/resume")
Reported-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
[rjw: Changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If cpufreq_policy_restore() returns NULL during system resume,
__cpufreq_add_dev() should just fall back to the full initialization
instead of returning an error, because that may actually make things
work. Moreover, it should not leave stale fallback data behind after
it has failed to restore a previously existing policy.
This change is based on Viresh Kumar's work.
Fixes: 5302c3fb2e ("cpufreq: Perform light-weight init/teardown during suspend/resume")
Reported-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
When configuring a default governor (via CONFIG_CPU_FREQ_DEFAULT_*) with the
intel_pstate driver, the desired default policy is not properly set. For
example, setting 'CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE' ends up with the
'powersave' policy being set.
Fix by configuring the correct default policy, if either 'powersave' or
'performance' are requested. Otherwise, fallback to what the driver originally
set via its 'init' routine.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are cases where cpufreq_add_dev() may fail for some CPUs
during system resume. With the current code we will still have
sysfs cpufreq files for those CPUs and struct cpufreq_policy
would be already freed for them. Hence any operation on those
sysfs files would result in kernel warnings.
Example of problems resulting from resume errors (from Bjørn Mork):
WARNING: CPU: 0 PID: 6055 at fs/sysfs/file.c:343 sysfs_open_file+0x77/0x212()
missing sysfs attribute operations for kobject: (null)
Modules linked in: [stripped as irrelevant]
CPU: 0 PID: 6055 Comm: grep Tainted: G D 3.13.0-rc2 #153
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
0000000000000009 ffff8802327ebb78 ffffffff81380b0e 0000000000000006
ffff8802327ebbc8 ffff8802327ebbb8 ffffffff81038635 0000000000000000
ffffffff811823c7 ffff88021a19e688 ffff88021a19e688 ffff8802302f9310
Call Trace:
[<ffffffff81380b0e>] dump_stack+0x55/0x76
[<ffffffff81038635>] warn_slowpath_common+0x7c/0x96
[<ffffffff811823c7>] ? sysfs_open_file+0x77/0x212
[<ffffffff810386e3>] warn_slowpath_fmt+0x41/0x43
[<ffffffff81182dec>] ? sysfs_get_active+0x6b/0x82
[<ffffffff81182382>] ? sysfs_open_file+0x32/0x212
[<ffffffff811823c7>] sysfs_open_file+0x77/0x212
[<ffffffff81182350>] ? sysfs_schedule_callback+0x1ac/0x1ac
[<ffffffff81122562>] do_dentry_open+0x17c/0x257
[<ffffffff8112267e>] finish_open+0x41/0x4f
[<ffffffff81130225>] do_last+0x80c/0x9ba
[<ffffffff8112dbbd>] ? inode_permission+0x40/0x42
[<ffffffff81130606>] path_openat+0x233/0x4a1
[<ffffffff81130b7e>] do_filp_open+0x35/0x85
[<ffffffff8113b787>] ? __alloc_fd+0x172/0x184
[<ffffffff811232ea>] do_sys_open+0x6b/0xfa
[<ffffffff811233a7>] SyS_openat+0xf/0x11
[<ffffffff8138c812>] system_call_fastpath+0x16/0x1b
To fix this, remove those sysfs files or put the associated kobject
in case of such errors. Also, to make it simple, remove the cpufreq
sysfs links from all the CPUs (except for the policy->cpu) during
suspend, as that operation won't result in a loss of sysfs file
permissions and we can create those links during resume just fine.
Fixes: 5302c3fb2e ("cpufreq: Perform light-weight init/teardown during suspend/resume")
Reported-and-tested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
[rjw: Changelog]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 2167e2399d (cpufreq: fix garbage kobjects on errors during
suspend/resume) breaks suspend/resume on Martin Ziegler's system
(hard lockup during resume), so revert it.
Fixes: 2167e2399d (cpufreq: fix garbage kobjects on errors during suspend/resume)
References: https://bugzilla.kernel.org/show_bug.cgi?id=66751
Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 5a87182aa2 (cpufreq: suspend governors on system
suspend/hibernate) causes hibernation problems to happen on
Bjørn Mork's and Paul Bolle's systems, so revert it.
Fixes: 5a87182aa2 (cpufreq: suspend governors on system suspend/hibernate)
Reported-by: Bjørn Mork <bjorn@mork.no>
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is effectively a revert of commit 5302c3fb2e ("cpufreq: Perform
light-weight init/teardown during suspend/resume"), which enabled
suspend/resume optimizations leaving the sysfs files in place.
Errors during suspend/resume are not handled properly, leaving
dead sysfs attributes in case of failures. There are are number of
functions with special code for the "frozen" case, and all these
need to also have special error handling.
The problem is easy to demonstrate by making cpufreq_driver->init()
or cpufreq_driver->get() fail during resume.
The code is too complex for a simple fix, with split code paths
in multiple blocks within a number of functions. It is therefore
best to revert the patch enabling this code until the error handling
is in place.
Examples of problems resulting from resume errors:
WARNING: CPU: 0 PID: 6055 at fs/sysfs/file.c:343 sysfs_open_file+0x77/0x212()
missing sysfs attribute operations for kobject: (null)
Modules linked in: [stripped as irrelevant]
CPU: 0 PID: 6055 Comm: grep Tainted: G D 3.13.0-rc2 #153
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
0000000000000009 ffff8802327ebb78 ffffffff81380b0e 0000000000000006
ffff8802327ebbc8 ffff8802327ebbb8 ffffffff81038635 0000000000000000
ffffffff811823c7 ffff88021a19e688 ffff88021a19e688 ffff8802302f9310
Call Trace:
[<ffffffff81380b0e>] dump_stack+0x55/0x76
[<ffffffff81038635>] warn_slowpath_common+0x7c/0x96
[<ffffffff811823c7>] ? sysfs_open_file+0x77/0x212
[<ffffffff810386e3>] warn_slowpath_fmt+0x41/0x43
[<ffffffff81182dec>] ? sysfs_get_active+0x6b/0x82
[<ffffffff81182382>] ? sysfs_open_file+0x32/0x212
[<ffffffff811823c7>] sysfs_open_file+0x77/0x212
[<ffffffff81182350>] ? sysfs_schedule_callback+0x1ac/0x1ac
[<ffffffff81122562>] do_dentry_open+0x17c/0x257
[<ffffffff8112267e>] finish_open+0x41/0x4f
[<ffffffff81130225>] do_last+0x80c/0x9ba
[<ffffffff8112dbbd>] ? inode_permission+0x40/0x42
[<ffffffff81130606>] path_openat+0x233/0x4a1
[<ffffffff81130b7e>] do_filp_open+0x35/0x85
[<ffffffff8113b787>] ? __alloc_fd+0x172/0x184
[<ffffffff811232ea>] do_sys_open+0x6b/0xfa
[<ffffffff811233a7>] SyS_openat+0xf/0x11
[<ffffffff8138c812>] system_call_fastpath+0x16/0x1b
The failure to restore cpufreq devices on cancelled hibernation is
not a new bug. It is caused by the ACPI _PPC call failing unless the
hibernate is completed. This makes the acpi_cpufreq driver fail its
init.
Previously, the cpufreq device could be restored by offlining the
cpu temporarily. And as a complete hibernation cycle would do this,
it would be automatically restored most of the time. But after
commit 5302c3fb2e the leftover sysfs attributes will block any
device add action. Therefore offlining and onlining CPU 1 will no
longer restore the cpufreq object, and a complete suspend/resume
cycle will replace it with garbage.
Fixes: 5302c3fb2e ("cpufreq: Perform light-weight init/teardown during suspend/resume")
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds cpufreq suspend/resume calls to dpm_{suspend|resume}_noirq()
for handling suspend/resume of cpufreq governors.
Lan Tianyu (Intel) & Jinhyuk Choi (Broadcom) found anr issue where
tunables configuration for clusters/sockets with non-boot CPUs was
getting lost after suspend/resume, as we were notifying governors
with CPUFREQ_GOV_POLICY_EXIT on removal of the last cpu for that
policy and so deallocating memory for tunables. This is fixed by
this patch as we don't allow any operation on governors after
device suspend and before device resume now.
Reported-and-tested-by: Lan Tianyu <tianyu.lan@intel.com>
Reported-by: Jinhyuk Choi <jinchoi@broadcom.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[rjw: Changelog, minor cleanups]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Most of the drivers do following in their ->target_index() routines:
struct cpufreq_freqs freqs;
freqs.old = old freq...
freqs.new = new freq...
cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
/* Change rate here */
cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
This is replicated over all cpufreq drivers today and there doesn't exists a
good enough reason why this shouldn't be moved to cpufreq core instead.
There are few special cases though, like exynos5440, which doesn't do everything
on the call to ->target_index() routine and call some kind of bottom halves for
doing this work, work/tasklet/etc..
They may continue doing notification from their own code as flag:
CPUFREQ_ASYNC_NOTIFICATION is already set for them.
All drivers are also modified in this patch to avoid breaking 'git bisect', as
double notification would happen otherwise.
Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Russell King <linux@arm.linux.org.uk>
Acked-by: Stephen Warren <swarren@nvidia.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Reviewed-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We have per-CPU cpu_policy_rwsem for cpufreq core, but we never use
all of them. We always use rwsem of policy->cpu and so we can
actually make this rwsem per policy instead.
This patch does this change. With this change other tricky situations
are also avoided now, like which lock to take while we are changing
policy->cpu, etc.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the prototype of cpufreq_drivers target routines is:
int target(struct cpufreq_policy *policy, unsigned int target_freq,
unsigned int relation);
And most of the drivers call cpufreq_frequency_table_target() to get a valid
index of their frequency table which is closest to the target_freq. And they
don't use target_freq and relation after that.
So, it makes sense to just do this work in cpufreq core before calling
cpufreq_frequency_table_target() and simply pass index instead. But this can be
done only with drivers which expose their frequency table with cpufreq core. For
others we need to stick with the old prototype of target() until those drivers
are converted to expose frequency tables.
This patch implements the new light weight prototype for target_index() routine.
It looks like this:
int target_index(struct cpufreq_policy *policy, unsigned int index);
CPUFreq core will call cpufreq_frequency_table_target() before calling this
routine and pass index to it. Because CPUFreq core now requires to call routines
present in freq_table.c CONFIG_CPU_FREQ_TABLE must be enabled all the time.
This also marks target() interface as deprecated. So, that new drivers avoid
using it. And Documentation is updated accordingly.
It also converts existing .target() to newly defined light weight
.target_index() routine for many driver.
Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Russell King <linux@arm.linux.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rjw@rjwysocki.net>
The function update_policy_cpu() is expected to be called when the policy->cpu
of a cpufreq policy is to be changed: ie., the new CPU nominated to become the
policy->cpu is different from the old one.
Print a warning if it is invoked with new_cpu == old_cpu, since such an
invocation might hint at a faulty logic in the caller.
Suggested-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CONFIG_CPU_FREQ_TABLE will be always enabled when cpufreq framework is used, as
cpufreq core depends on it. So, we don't need this CONFIG option anymore as it
is not configurable. Remove CONFIG_CPU_FREQ_TABLE and update its users.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Many CPUFreq drivers for SMP system (where all cores share same clock lines), do
similar stuff in their ->init() part.
This patch creates a generic routine in cpufreq core which can be used by these
so that we can remove some redundant code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Almost all drivers set policy->cur with current CPU frequency in their ->init()
part. This can be done for all of them at core level and so they wouldn't need
to do it.
This patch adds supporting code in cpufreq core for calling get() after we have
called init() for a policy.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use cpufreq_driver->flags to mark CPUFREQ_HAVE_GOVERNOR_PER_POLICY instead
of a separate field within cpufreq_driver. This will save some bytes of
memory.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Earlier there used to be two functions named __cpufreq_set_policy() and
cpufreq_set_policy(), but now we only have a single routine lets name it
cpufreq_set_policy() instead of __cpufreq_set_policy().
This also removes some invalid comments or fixes some incorrect comments.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Nobody except cpufreq_remove_dev() calls __cpufreq_remove_dev() and
so we don't need two separate routines here. Merge code from
__cpufreq_remove_dev() into cpufreq_remove_dev() and get rid of
__cpufreq_remove_dev().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As a rule its better not to break string (quoted inside "") in a
print statement even if it crosses 80 column boundary as that may
introduce bugs and so this patch rewrites one of the print statements..
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We don't need a blank line just at start of a block, lets remove it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some section of kerneldoc comment for __cpufreq_remove_dev() is invalid now.
Remove it.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
lock_policy_rwsem_{read|write}() currently has return type of int,
but it always returns zero and hence its return type should be void
instead. This patch makes that change and modifies all of the users
accordingly.
Reported-by: Jon Medhurst<tixy@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq_get() can be called from external drivers which might not be aware if
cpufreq driver is registered or not. And so we should actually check if cpufreq
driver is registered or not and also if cpufreq is active or disabled, at the
beginning of cpufreq_get().
Otherwise call to lock_policy_rwsem_read() might hit BUG_ON(!policy).
Reported-and-tested-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On systems that support intel_pstate, acpi_cpufreq fails to load, and
udev keeps trying until trace gets filled up and kernel crashes.
The root cause is driver return ret from cpufreq_register_driver(),
because when some other driver takes over before, it will return
EBUSY and then udev will keep trying ...
cpufreq_register_driver() should return EEXIST instead so that the
system can boot without appending intel_pstate=disable and still use
intel_pstate.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Current code looks like this:
WARN_ON(lock_policy_rwsem_write(cpu));
update_policy_cpu(policy, new_cpu);
unlock_policy_rwsem_write(cpu);
{lock|unlock}_policy_rwsem_write(cpu) takes/releases policy->cpu's rwsem.
Because cpu is changing with the call to update_policy_cpu(), the
unlock_policy_rwsem_write() will release the incorrect lock.
The right solution would be to release the same lock as was taken earlier. Also
update_policy_cpu() was also called from cpufreq_add_dev() without any locks and
so its better if we move this locking to inside update_policy_cpu().
This patch fixes a regression introduced in 3.12 by commit f9ba680d23
(cpufreq: Extract the handover of policy cpu to a helper function).
Reported-and-tested-by: Jon Medhurst<tixy@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This broke after a recent change "cedb70a cpufreq: Split __cpufreq_remove_dev()
into two parts" from Srivatsa.
Consider a scenario where we have two CPUs in a policy (0 & 1) and we are
removing CPU 1. On the call to __cpufreq_remove_dev_prepare() we have cleared 1
from policy->cpus and now on a call to __cpufreq_remove_dev_finish() we read
cpumask_weight of policy->cpus, which will come as 1 and this code will behave
as if we are removing the last CPU from policy :)
Fix it by clearing the CPU mask in __cpufreq_remove_dev_finish() instead of
__cpufreq_remove_dev_prepare().
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In cpufreq_policy_restore() before system suspend policy is read from
percpu's cpufreq_cpu_data_fallback. It's a read operation rather
than a write one, so take the lock for reading in there.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If update_policy_cpu() is invoked with the existing policy->cpu itself
as the new-cpu parameter, then a lot of things can go terribly wrong.
In its present form, update_policy_cpu() always assumes that the new-cpu
is different from policy->cpu and invokes other functions to perform their
respective updates. And those functions implement the actual update like
this:
per_cpu(..., new_cpu) = per_cpu(..., last_cpu);
per_cpu(..., last_cpu) = NULL;
Thus, when new_cpu == last_cpu, the final NULL assignment makes the per-cpu
references vanish into thin air! (memory leak). From there, it leads to more
problems: cpufreq_stats_create_table() now doesn't find the per-cpu reference
and hence tries to create a new sysfs-group; but sysfs already had created
the group earlier, so it complains that it cannot create a duplicate filename.
In short, the repercussions of a rather innocuous invocation of
update_policy_cpu() can turn out to be pretty nasty.
Ideally update_policy_cpu() should handle this situation (new == last)
gracefully, and not lead to such severe problems. So fix it by adding an
appropriate check.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In __cpufreq_remove_dev_prepare(), the code which decides whether to remove
the sysfs link or nominate a new policy cpu, is governed by an if/else block
with a rather complex set of conditionals. Worse, they harbor a subtlety
which leads to certain unintended behavior.
The code looks like this:
if (cpu != policy->cpu && !frozen) {
sysfs_remove_link(&dev->kobj, "cpufreq");
} else if (cpus > 1) {
new_cpu = cpufreq_nominate_new_policy_cpu(...);
...
update_policy_cpu(..., new_cpu);
}
The original intention was:
If the CPU going offline is not policy->cpu, just remove the link.
On the other hand, if the CPU going offline is the policy->cpu itself,
handover the policy->cpu job to some other surviving CPU in that policy.
But because the 'if' condition also includes the 'frozen' check, now there
are *two* possibilities by which we can enter the 'else' block:
1. cpu == policy->cpu (intended)
2. cpu != policy->cpu && frozen (unintended)
Due to the second (unintended) scenario, we end up spuriously nominating
a CPU as the policy->cpu, even when the existing policy->cpu is alive and
well. This can cause problems further down the line, especially when we end
up nominating the same policy->cpu as the new one (ie., old == new),
because it totally confuses update_policy_cpu().
To avoid this mess, restructure the if/else block to only do what was
originally intended, and thus prevent any unwelcome surprises.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Stephen Warren reported that the cpufreq-stats code hits a NULL pointer
dereference during the second attempt to suspend a system. He also
pin-pointed the problem to commit 5302c3f "cpufreq: Perform light-weight
init/teardown during suspend/resume".
That commit actually ensured that the cpufreq-stats table and the
cpufreq-stats sysfs entries are *not* torn down (ie., not freed) during
suspend/resume, which makes it all the more surprising. However, it turns
out that the root-cause is not that we access an already freed memory, but
that the reference to the allocated memory gets moved around and we lose
track of that during resume, leading to the reported crash in a subsequent
suspend attempt.
In the suspend path, during CPU offline, the value of policy->cpu is updated
by choosing one of the surviving CPUs in that policy, as long as there is
atleast one CPU in that policy. And cpufreq_stats_update_policy_cpu() is
invoked to update the reference to the stats structure by assigning it to
the new CPU. However, in the resume path, during CPU online, we end up
assigning a fresh CPU as the policy->cpu, without letting cpufreq-stats
know about this. Thus the reference to the stats structure remains
(incorrectly) associated with the old CPU. So, in a subsequent suspend attempt,
during CPU offline, we end up accessing an incorrect location to get the
stats structure, which eventually leads to the NULL pointer dereference.
Fix this by letting cpufreq-stats know about the update of the policy->cpu
during CPU online in the resume path. (Also, move the update_policy_cpu()
function higher up in the file, so that __cpufreq_add_dev() can invoke
it).
Reported-and-tested-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 7c30ed5 (cpufreq: make sure frequency transitions are
serialized) attempted to serialize frequency transitions by
adding checks to the CPUFREQ_PRECHANGE and CPUFREQ_POSTCHANGE
notifications. However, it assumed that the notifications will
always originate from the driver's .target() callback, but they
also can be triggered by cpufreq_out_of_sync() and that leads to
warnings like this on some systems:
WARNING: CPU: 0 PID: 14543 at drivers/cpufreq/cpufreq.c:317
__cpufreq_notify_transition+0x238/0x260()
In middle of another frequency transition
accompanied by a call trace similar to this one:
[<ffffffff81720daa>] dump_stack+0x46/0x58
[<ffffffff8106534c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff815b8560>] ? acpi_cpufreq_target+0x320/0x320
[<ffffffff81065436>] warn_slowpath_fmt+0x46/0x50
[<ffffffff815b1ec8>] __cpufreq_notify_transition+0x238/0x260
[<ffffffff815b33be>] cpufreq_notify_transition+0x3e/0x70
[<ffffffff815b345d>] cpufreq_out_of_sync+0x6d/0xb0
[<ffffffff815b370c>] cpufreq_update_policy+0x10c/0x160
[<ffffffff815b3760>] ? cpufreq_update_policy+0x160/0x160
[<ffffffff81413813>] cpufreq_set_cur_state+0x8c/0xb5
[<ffffffff814138df>] processor_set_cur_state+0xa3/0xcf
[<ffffffff8158e13c>] thermal_cdev_update+0x9c/0xb0
[<ffffffff8159046a>] step_wise_throttle+0x5a/0x90
[<ffffffff8158e21f>] handle_thermal_trip+0x4f/0x140
[<ffffffff8158e377>] thermal_zone_device_update+0x57/0xa0
[<ffffffff81415b36>] acpi_thermal_check+0x2e/0x30
[<ffffffff81415ca0>] acpi_thermal_notify+0x40/0xdc
[<ffffffff813e7dbd>] acpi_device_notify+0x19/0x1b
[<ffffffff813f8241>] acpi_ev_notify_dispatch+0x41/0x5c
[<ffffffff813e3fbe>] acpi_os_execute_deferred+0x25/0x32
[<ffffffff81081060>] process_one_work+0x170/0x4a0
[<ffffffff81082121>] worker_thread+0x121/0x390
[<ffffffff81082000>] ? manage_workers.isra.20+0x170/0x170
[<ffffffff81088fe0>] kthread+0xc0/0xd0
[<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
[<ffffffff8173582c>] ret_from_fork+0x7c/0xb0
[<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
For this reason, revert commit 7c30ed5 along with the fix 266c13d
(cpufreq: Fix serialization of frequency transitions) on top of it
and we will revisit the serialization problem later.
Reported-by: Alessandro Bono <alessandro.bono@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are places where the variable 'ret' is declared as unsigned int
and then used to store negative return values such as -EINVAL. Fix them
by declaring the variable as a signed quantity.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit "cpufreq: serialize calls to __cpufreq_governor()" had been a temporary
and partial solution to the race condition between writing to a cpufreq sysfs
file and taking a CPU offline. Now that we have a proper and complete solution
to that problem, remove the temporary fix.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The functions that are used to write to cpufreq sysfs files (such as
store_scaling_max_freq()) are not hotplug safe. They can race with CPU
hotplug tasks and lead to problems such as trying to acquire an already
destroyed timer-mutex etc.
Eg:
__cpufreq_remove_dev()
__cpufreq_governor(policy, CPUFREQ_GOV_STOP);
policy->governor->governor(policy, CPUFREQ_GOV_STOP);
cpufreq_governor_dbs()
case CPUFREQ_GOV_STOP:
mutex_destroy(&cpu_cdbs->timer_mutex)
cpu_cdbs->cur_policy = NULL;
<PREEMPT>
store()
__cpufreq_set_policy()
__cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
policy->governor->governor(policy, CPUFREQ_GOV_LIMITS);
case CPUFREQ_GOV_LIMITS:
mutex_lock(&cpu_cdbs->timer_mutex); <-- Warning (destroyed mutex)
if (policy->max < cpu_cdbs->cur_policy->cur) <- cur_policy == NULL
So use get_online_cpus()/put_online_cpus() in the store_*() functions, to
synchronize with CPU hotplug. However, there is an additional point to note
here: some parts of the CPU teardown in the cpufreq subsystem are done in
the CPU_POST_DEAD stage, with cpu_hotplug.lock *released*. So, using the
get/put_online_cpus() functions alone is insufficient; we should also ensure
that we don't race with those latter steps in the hotplug sequence. We can
easily achieve this by checking if the CPU is online before proceeding with
the store, since the CPU would have been marked offline by the time the
CPU_POST_DEAD notifiers are executed.
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
__cpufreq_remove_dev_finish() handles the kobject cleanup for a CPU going
offline. But because we destroy the kobject towards the end of the CPU offline
phase, there are certain race windows where a task can try to write to a
cpufreq sysfs file (eg: using store_scaling_max_freq()) while we are taking
that CPU offline, and this can bump up the kobject refcount, which in turn might
hinder the CPU offline task from running to completion. (It can also cause
other more serious problems such as trying to acquire a destroyed timer-mutex
etc., depending on the exact stage of the cleanup at which the task managed to
take a new refcount).
To fix the race window, we will need to synchronize those store_*() call-sites
with CPU hotplug, using get_online_cpus()/put_online_cpus(). However, that
in turn can cause a total deadlock because it can end up waiting for the
CPU offline task to complete, with incremented refcount!
Write to sysfs CPU offline task
-------------- ----------------
kobj_refcnt++
Acquire cpu_hotplug.lock
get_online_cpus();
Wait for kobj_refcnt to drop to zero
**DEADLOCK**
A simple way to avoid this problem is to perform the kobject cleanup in the
CPU offline path, with the cpu_hotplug.lock *released*. That is, we can
perform the wait-for-kobj-refcnt-to-drop as well as the subsequent cleanup
in the CPU_POST_DEAD stage of CPU offline, which is run with cpu_hotplug.lock
released. Doing this helps us avoid deadlocks due to holding kobject refcounts
and waiting on each other on the cpu_hotplug.lock.
(Note: We can't move all of the cpufreq CPU offline steps to the
CPU_POST_DEAD stage, because certain things such as stopping the governors
have to be done before the outgoing CPU is marked offline. So retain those
parts in the CPU_DOWN_PREPARE stage itself).
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
During CPU offline, the cpufreq core invokes __cpufreq_remove_dev()
to perform work such as stopping the cpufreq governor, clearing the
CPU from the policy structure etc, and finally cleaning up the
kobject.
There are certain subtle issues related to the kobject cleanup, and
it would be much easier to deal with them if we separate that part
from the rest of the cleanup-work in the CPU offline phase. So split
the __cpufreq_remove_dev() function into 2 parts: one that handles
the kobject cleanup, and the other that handles the rest of the work.
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We can't take a big lock around __cpufreq_governor() as this causes
recursive locking for some cases. But calls to this routine must be
serialized for every policy. Otherwise we can see some unpredictable
events.
For example, consider following scenario:
__cpufreq_remove_dev()
__cpufreq_governor(policy, CPUFREQ_GOV_STOP);
policy->governor->governor(policy, CPUFREQ_GOV_STOP);
cpufreq_governor_dbs()
case CPUFREQ_GOV_STOP:
mutex_destroy(&cpu_cdbs->timer_mutex)
cpu_cdbs->cur_policy = NULL;
<PREEMPT>
store()
__cpufreq_set_policy()
__cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
policy->governor->governor(policy, CPUFREQ_GOV_LIMITS);
case CPUFREQ_GOV_LIMITS:
mutex_lock(&cpu_cdbs->timer_mutex); <-- Warning (destroyed mutex)
if (policy->max < cpu_cdbs->cur_policy->cur) <- cur_policy == NULL
And so store() will eventually result in a crash if cur_policy is
NULL at this point.
Introduce an additional variable which would guarantee serialization
here.
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
__cpufreq_governor() returns with -EBUSY when governor is already
stopped and we try to stop it again, but when it is stopped we must
not allow calls to CPUFREQ_GOV_LIMITS event as well.
This patch adds this check in __cpufreq_governor().
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To iterate over all policies we currently iterate over all online
CPUs and then get the policy for each of them which is suboptimal.
Use the newly created cpufreq_policy_list for this purpose instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq_policy_cpu per-cpu variables are used for storing the ID of
the CPU that manages the given CPU's policy. However, we also store
a policy pointer for each cpu in cpufreq_cpu_data, so the
cpufreq_policy_cpu information is simply redundant.
It is better to use cpufreq_cpu_data to retrieve a policy and get
policy->cpu from there, so make that happen everywhere and drop the
cpufreq_policy_cpu per-cpu variables which aren't necessary any more.
[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We don't need to check if event is CPUFREQ_GOV_POLICY_INIT and put
governor module as we are sure event can only be START/STOP here.
Remove the useless check.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq_policy_list is a list of active policies. We do remove
policies from this list when all CPUs belonging to that policy are
removed. But during system suspend we don't really free a policy
struct as it will be used again during resume, so we didn't remove
it from cpufreq_policy_list as well..
However, this is incorrect. We are saying this policy isn't valid
anymore and must not be referenced (though we haven't freed it), but
it can still be used by code that iterates over cpufreq_policy_list.
Remove policy from this list during system suspend as well.
Of course, we must add it back whenever the first CPU belonging to
that policy shows up.
[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Align closing brace '}' of an if block.
[rjw: Subject and changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit eb60852 (cpufreq: Use cpufreq_policy_list for iterating
over policies), because it breaks system suspend/resume on multiple
machines.
It either causes resume to block indefinitely or causes the BUG_ON()
in lock_policy_rwsem_##mode() to trigger on sysfs accesses to cpufreq
attributes.
Conflicts:
drivers/cpufreq/cpufreq.c
The __cpufreq_governor() function can fail in rare cases especially
if there are bugs in cpufreq drivers. Thus we must stop processing
as soon as this routine fails, otherwise it may result in undefined
behavior.
This patch adds error checking code whenever this routine is called
from any place.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Critical sections of the cpufreq core are protected with the help of
the driver module owner's refcount, which isn't the correct approach,
because it causes rmmod to return an error when some routine has
updated that refcount.
Let's use rwsem for this purpose instead. Only
cpufreq_unregister_driver() will use write sem
and everybody else will use read sem.
[rjw: Subject & changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq governor owner refcount usage is broken. We should only
increment that refcount when a CPUFREQ_GOV_POLICY_INIT event has come
and it should only be decremented if CPUFREQ_GOV_POLICY_EXIT has come.
Currently, there can be situations where the governor is in use, but
we have allowed it to be unloaded which may result in undefined
behavior. Let's fix it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To iterate over all policies we currently iterate over all CPUs and
then get the policy for each of them. Let's use the newly created
cpufreq_policy_list for this purpose.
[rjw: Changelog]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Policies available in the cpufreq framework are now linked together.
They are accessible via cpufreq_policy_list defined in the cpufreq
core.
[rjw: Fix from Yinghai Lu folded in]
Signed-off-by: Lukasz Majewski <l.majewski@samsung.com>
Signed-off-by: Myungjoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>