When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image. Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.
To this end, annotate module_param* statements that refer to hardware
configuration and indicate for future reference what type of parameter they
specify. The parameter parser in the core sees this information and can
skip such parameters with an error message if the kernel is locked down.
The module initialisation then runs as normal, but just sees whatever the
default values for those parameters is.
Note that we do still need to do the module initialisation because some
drivers have viable defaults set in case parameters aren't specified and
some drivers support automatic configuration (e.g. PNP or PCI) in addition
to manually coded parameters.
This patch annotates drivers in drivers/cpufreq/.
Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
cc: linux-pm@vger.kernel.org
Add a new cpufreq driver for Tegra186 (and likely later).
The CPUs are organized into two clusters, Denver and A57,
with two and four cores respectively. CPU frequency can be
adjusted by writing the desired rate divisor and a voltage
hint to a special per-core register.
The frequency of each core can be set individually; however,
this is just a hint as all CPUs in a cluster will run at
the maximum rate of non-idle CPUs in the cluster.
Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
According to the previous error handling code, it is likely that
'goto out_free_opp' is expected here in order to avoid a memory leak in
error handling path.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the cpufreq driver tries to modify voltage/freq during suspend/resume
it might need to control an external PMIC via I2C or SPI but those
devices might be already suspended. This issue is likely to happen
whenever the LDOs have their vin-supply set.
To avoid this scenario we just increase cpufreq to the maximum before
suspend.
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If there are any errors in getting the cpu0 regulators, the driver returns
-ENOENT. In case the regulators are not yet available, the devm_regulator_get
calls will return -EPROBE_DEFER, so that the driver can be probed later.
If we return -ENOENT, the driver will fail its initialization and will
not try to probe again (when the regulators become available).
Return the actual error received from regulator_get in probe. Print a
differentiated message in case we need to probe the device later and
in case we actually failed. Also add a message to inform when the
driver has been successfully registered.
Signed-off-by: Irina Tirdea <irina.tirdea@nxp.com>
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).
That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.
Make intel_pstate set transition_delay_us to 500.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The access to the HBIRD_ESTAR_MODE register in the cpu frequency control
functions must happen on the target CPU. This is achieved by temporarily
setting the affinity of the calling user space thread to the requested CPU
and reset it to the original affinity afterwards.
That's racy vs. CPU hotplug and concurrent affinity settings for that
thread resulting in code executing on the wrong CPU and overwriting the
new affinity setting.
Replace it by a straight forward smp function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: linux-pm@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704131020280.2408@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The access to the safari config register in the CPU frequency functions
must be executed on the target CPU. This is achieved by temporarily setting
the affinity of the calling user space thread to the requested CPU and
reset it to the original affinity afterwards.
That's racy vs. CPU hotplug and concurrent affinity settings for that
thread resulting in code executing on the wrong CPU and overwriting the
new affinity setting.
Replace it by a straight forward smp function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: linux-pm@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/20170412201043.047558840@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The target() callback must run on the affected cpu. This is achieved by
temporarily setting the affinity of the calling thread to the requested CPU
and reset it to the original affinity afterwards.
That's racy vs. concurrent affinity settings for that thread resulting in
code executing on the wrong CPU.
Replace it by work_on_cpu(). All call pathes which invoke the callbacks are
already protected against CPU hotplug.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: linux-pm@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/20170412201042.958216363@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The get() and target() callbacks must run on the affected cpu. This is
achieved by temporarily setting the affinity of the calling thread to the
requested CPU and reset it to the original affinity afterwards.
That's racy vs. concurrent affinity settings for that thread resulting in
code executing on the wrong CPU and overwriting the new affinity setting.
Replace it by work_on_cpu(). All call pathes which invoke the callbacks are
already protected against CPU hotplug.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: linux-pm@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704122231100.2548@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is a report that after commit 27622b061e ("cpufreq: Convert
to hotplug state machine"), the normal CPU offline/online cycle
fails on some platforms.
According to the ftrace result, this problem was triggered on
platforms using acpi-cpufreq as the default cpufreq driver,
and due to the lack of some ACPI freq method (eg. _PCT),
cpufreq_online() failed and returned a negative value, so the CPU
hotplug state machine rolled back the CPU online process. Actually,
from the user's perspective, the failure of cpufreq_online() should
not prevent that CPU from being brought up, although cpufreq might
not work on that CPU.
BTW, during system startup cpufreq_online() is not invoked via CPU
online but by the cpufreq device creation process, so the APs can be
brought up even though cpufreq_online() fails in that stage.
This patch ignores the return value of cpufreq_online/offline() and
lets the cpufreq framework deal with the failure. cpufreq_online()
itself will do a proper rollback in that case and if _PCT is missing,
the ACPI cpufreq driver will print a warning if the corresponding
debug options have been enabled.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=194581
Fixes: 27622b061e ("cpufreq: Convert to hotplug state machine")
Reported-and-tested-by: Tomasz Maciej Nowak <tmn505@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is pure mystery to me why we need to be on a specific CPU while
looking up a value in an array.
My best shot at this is that before commit d4019f0a92 ("cpufreq: move
freq change notifications to cpufreq core") it was required to invoke
cpufreq_notify_transition() on a special CPU.
Since it looks like a waste, remove it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: tglx@linutronix.de
Cc: linux-pm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15888/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Use same parameters as INTEL_FAM6_ATOM_GOLDMONT to enable
Gemini Lake.
Signed-off-by: Box, David E <david.e.box@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some computations in intel_pstate_get_min_max() are not necessary
and one of its two callers doesn't even use the full result.
First off, the fixed-point value of cpu->max_perf represents a
non-negative number between 0 and 1 inclusive and cpu->min_perf
cannot be greater than cpu->max_perf. It is not necessary to check
those conditions every time the numbers in question are used.
Moreover, since intel_pstate_max_within_limits() only needs the
upper boundary, it doesn't make sense to compute the lower one in
there and returning min and max from intel_pstate_get_min_max()
via pointers doesn't look particularly nice.
For the above reasons, drop intel_pstate_get_min_max(), add a helper
to get the base P-state for min/max computations and carry out them
directly in the previous callers of intel_pstate_get_min_max().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
intel_pstate_hwp_set() is the only function walking policy->cpus
in intel_pstate. The rest of the code simply assumes one CPU per
policy, including the initialization code.
Therefore it doesn't make sense for intel_pstate_hwp_set() to
walk policy->cpus as it is guaranteed to have only one bit set
for policy->cpu.
For this reason, rearrange intel_pstate_hwp_set() to take the CPU
number as the argument and drop the loop over policy->cpus from it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a new function pid_in_use() to return the information on whether
or not the PID-based P-state selection algorithm is in use.
That allows a couple of complicated conditions in the code to be
reduced to simple checks against the new function's return value.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpu_defaults structure is redundant, because it only contains
one member of type struct pstate_funcs which can be used directly
instead of struct cpu_defaults.
For this reason, drop struct cpu_defaults, use struct pstate_funcs
directly instead of it where applicable and rename all of the
variables of that type accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the definitions of the cpu_defaults structures after the
definitions of utilization update callback routines to avoid
extra declarations of the latter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid using extra function pointers during P-state selection by
dropping the get_target_pstate member from struct pstate_funcs,
adding a new update_util callback to it (to be registered with
the CPU scheduler as the utilization update callback in the active
mode) and reworking the utilization update callback routines to
invoke specific P-state selection functions directly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that some overhead in the utilization update callbacks
registered by intel_pstate in the active mode can be avoided if
those callbacks are tailored to specific configurations of the
driver. For example, the utilization update callback for the HWP
enabled case only needs to update the average CPU performance
periodically whereas the utilization update callback for the
PID-based algorithm does not need to take IO-wait boosting into
account and so on.
With that in mind, define three utilization update callbacks for
three different use cases: HWP enabled, the CPU load "powersave"
P-state selection algorithm and the PID-based "powersave" P-state
selection algorithm and modify the driver initialization to
choose the callback matching its current configuration.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
One of the checks in intel_pstate_update_status() implicitly relies
on the information that there are only two struct cpufreq_driver
objects available, but it is better to do it directly against the
value it really is about (to make the code easier to follow if
nothing else).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The driver_registered variable in intel_pstate is used for checking
whether or not the driver has been registered, but intel_pstate_driver
can be used for that too (with the rule that the driver is not
registered as long as it is NULL).
That is a bit more straightforward and the code may be simplified
a bit this way, so modify the driver accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PID controller parameters only need to be initialized if the
get_target_pstate_use_performance() P-state selection routine
is going to be used. It is not necessary to initialize them
otherwise, so don't do that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the HWP enabled case pid_params.sample_rate_ns only needs to be
updated once, because it is global, so do that when setting hwp_active
instead of doing it during the initialization of every CPU.
Moreover, pid_params.sample_rate_ms is never used if HWP is enabled,
so do not update it at all then.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
intel_pstate_busy_pid_reset() is the only caller of pid_reset(),
pid_p_gain_set(), pid_i_gain_set(), and pid_d_gain_set(). Moreover,
it passes constants as two parameters of pid_reset() and all of
the other routines above essentially contain the same code, so
fold all of them into the caller and drop unnecessary computations.
Introduce percent_fp() for converting integer values in percent
to fixed-point fractions and use it in the above code cleanup.
Finally, rename intel_pstate_busy_pid_reset() to
intel_pstate_pid_reset() as it also is used for the
initialization of PID parameters for every CPU and the
meaning of the "busy" part of the name is not particularly
clear.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is only one caller of intel_pstate_reset_all_pid(), which is
pid_param_set() used in the debugfs interface only, and having that
code split does not make it particularly convenient to follow.
For this reason, move the body of intel_pstate_reset_all_pid() into
its caller and drop that function.
Also change the loop from for_each_online_cpu() (which is obviously
racy with respect to CPU offline/online) to for_each_possible_cpu(),
so that all PID parameters are reset for all CPUs regardless of their
online/offline status (to prevent, for example, a previously offline
CPU from going online with a stale set of PID parameters).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that both the existing struct cpu_defaults instances in which
PID parameters are actually initialized use the same values of those
parameters, so it is not really necessary to copy them over to
pid_params dynamically.
Instead, initialize pid_params statically with those values and
drop the unused pid_policy member from struct cpu_defaults along
with copy_pid_params() used for initializing it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The P-state selection algorithm used by intel_pstate for Atom
processors is not based on the PID controller and the initialization
of PID parametrs for those processors is pointless and confusing, so
drop it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After recent changes the purpose of struct perf_limits is not
particularly clear any more and the code may be made somewhat
easier to follow by eliminating it, so go for that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq core only tries to create symbolic links from CPU
directories in sysfs to policy directories in cpufreq_add_dev(),
either when a given CPU is registered or when the cpufreq driver
is registered, whichever happens first. That is not sufficient,
however, because cpufreq_add_dev() may be called for an offline CPU
whose policy object has not been created yet and, quite obviously,
the symbolic cannot be added in that case.
Fix that by making cpufreq_online() attempt to add symbolic links to
policy objects for the CPUs in the related_cpus mask of every new
policy object created by it.
The cpufreq_driver_lock locking around the for_each_cpu() loop
in cpufreq_online() is dropped, because it is not necessary and the
code is somewhat simpler without it. Moreover, failures to create
a symbolic link will not be regarded as hard errors any more and
the CPUs without those links will not be taken offline automatically,
but that should not be problematic in practice.
Reported-and-tested-by: Prashanth Prakash <pprakash@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
set policy->cpuinfo.max_freq depending on the turbo status, but the
updates made by them are discarded by the core, because the policy
object passed to them by the core is temporary and cpuinfo.max_freq
from that object is not copied to the final policy object in
cpufreq_set_policy().
However, cpufreq_set_policy() passes the temporary policy object
to the ->setpolicy callback of the driver, so intel_pstate_set_policy()
actually sees the policy->cpuinfo.max_freq value updated by
intel_pstate_verify_policy() and not the final one. It also
updates policy->max sometimes which basically has no effect after
it returns, because the core discards that update.
To avoid confusion, eliminate policy->cpuinfo.max_freq updates from
intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
entirely and check the maximum frequency explicitly in
intel_pstate_update_perf_limits() instead of relying on the
transiently updated policy->cpuinfo.max_freq value.
Moreover, move the max->policy adjustment carried out in
intel_pstate_set_policy() to a separate function and call that
function from the ->verify driver callbacks to ensure that it will
actually be effective.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The coordination of P-state limits used by intel_pstate in the active
mode (ie. by default) is problematic, because it synchronizes all of
the limits (ie. the global ones and the per-policy ones) so as to use
one common pair of P-state limits (min and max) across all CPUs in
the system. The drawbacks of that are as follows:
- If P-states are coordinated in hardware, it is not necessary
to coordinate them in software on top of that, so in that case
all of the above activity is in vain.
- If P-states are not coordinated in hardware, then the processor
is actually capable of setting different P-states for different
CPUs and coordinating them at the software level simply doesn't
allow that capability to be utilized.
- The coordination works in such a way that setting a per-policy
limit (eg. scaling_max_freq) for one CPU causes the common
effective limit to change (and it will affect all of the other
CPUs too), but subsequent reads from the corresponding sysfs
attributes for the other CPUs will return stale values (which
is confusing).
- Reads from the global P-state limit attributes, min_perf_pct and
max_perf_pct, return the effective common values and not the last
values set through these attributes. However, the last values
set through these attributes become hard limits that cannot be
exceeded by writes to scaling_min_freq and scaling_max_freq,
respectively, and they are not exposed, so essentially users
have to remember what they are.
All of that is painful enough to warrant a change of the management
of P-state limits in the active mode.
To that end, redesign the active mode P-state limits management in
intel_pstate in accordance with the following rules:
(1) All CPUs are affected by the global limits (that is, none of
them can be requested to run faster than the global max and
none of them can be requested to run slower than the global
min).
(2) Each individual CPU is affected by its own per-policy limits
(that is, it cannot be requested to run faster than its own
per-policy max and it cannot be requested to run slower than
its own per-policy min).
(3) The global and per-policy limits can be set independently.
Also, the global maximum and minimum P-state limits will be always
expressed as percentages of the maximum supported turbo P-state.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Extend the set of systems for which intel_pstate will use the
"powersave" P-state selection algorithm based on CPU load in the
active mode by systems with ACPI preferred profile set to "tablet",
"appliance PC", "desktop", or "workstation" (ie. everything with a
specified preferred profile that is not a "server").
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, some processors supporting HWP are only supported by
intel_pstate if HWP is actually going to be used and not supported
otherwise which is confusing.
Specifically, they are not supported if "intel_pstate=no_hwp" is
passed to the kernel in the command line or if the driver is started
in the passive mode ("intel_pstate=passive").
There is no real reason for that, because everything about those
processor is known anyway and the driver can work with them in all
modes, so make that happen, but use the load-based P-state selection
algorithm for the active mode "powersave" policy with them.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq-fixes:
cpufreq: Restore policy min/max limits on CPU online
* pm-cpufreq-sched-fixes:
cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
* intel_pstate-fixes:
cpufreq: intel_pstate: Fix policy data management in passive mode
cpufreq: intel_pstate: One set of global limits in active mode
On CPU online the cpufreq core restores the previous governor (or
the previous "policy" setting for ->setpolicy drivers), but it does
not restore the min/max limits at the same time, which is confusing,
inconsistent and real pain for users who set the limits and then
suspend/resume the system (using full suspend), in which case the
limits are reset on all CPUs except for the boot one.
Fix this by making cpufreq_online() restore the limits when an inactive
policy is brought online.
The commit log and patch are inspired from Rafael's earlier work.
Reported-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.3+ <stable@vger.kernel.org> # 4.3+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy->cpuinfo.max_freq and policy->max updates in
intel_cpufreq_turbo_update() are excessive as they are done for no
good reason and may lead to problems in principle, so they should be
dropped. However, after dropping them intel_cpufreq_turbo_update()
becomes almost entirely pointless, because the check made by it is
made again down the road in intel_pstate_prepare_request(). The
only thing in it that still needs to be done is the call to
update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether
and make its callers invoke update_turbo_state() directly instead of
it.
In addition to that, fix intel_cpufreq_verify_policy() so that it
checks global.no_turbo in addition to global.turbo_disabled when
updating policy->cpuinfo.max_freq to make it consistent with
intel_pstate_verify_policy().
Fixes: 001c76f05b (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the active mode intel_pstate currently uses two sets of global
limits, each associated with one of the possible scaling_governor
settings in that mode: "powersave" or "performance".
The driver switches over from one of those sets to the other
depending on the scaling_governor setting for the last CPU whose
per-policy cpufreq interface in sysfs was last used to change
parameters exposed in there. That obviously leads to no end of
issues when the scaling_governor settings differ between CPUs.
The most recent issue was introduced by commit a240c4aa5d (cpufreq:
intel_pstate: Do not reinit performance limits in ->setpolicy)
that eliminated the reinitialization of "performance" limits in
intel_pstate_set_policy() preventing the max limit from being set
to anything below 100, among other things.
Namely, an undesirable side effect of commit a240c4aa5d is that
now, after setting scaling_governor to "performance" in the active
mode, the per-policy limits for the CPU in question go to the highest
level and stay there even when it is switched back to "powersave"
later.
As it turns out, some distributions set scaling_governor to
"performance" temporarily for all CPUs to speed-up system
initialization, so that change causes them to misbehave later.
To fix that, get rid of the performance/powersave global limits
split and use just one set of global limits for everything.
From the user's persepctive, after this modification, when
scaling_governor is switched from "performance" to "powersave"
or the other way around on one CPU, the limits settings (ie. the
global max/min_perf_pct and per-policy scaling_max/min_freq for
any CPUs) will not change. Still, switching from "performance"
to "powersave" or the other way around changes the way in which
P-states are selected and in particular "performance" causes the
driver to always request the highest P-state it is allowed to ask
for for the given CPU.
Fixes: a240c4aa5d (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq-fixes:
cpufreq: Fix and clean up show_cpuinfo_cur_freq()
* intel_pstate-fixes:
cpufreq: intel_pstate: Avoid percentages in limits-related computations
cpufreq: intel_pstate: Correct frequency setting in the HWP mode
cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()
The best place to register the CPU cooling device is from the cpufreq
driver as we would know if all the resources are already available or
not. That's what is done for the cpufreq-dt.c driver as well.
The cpu-cooling driver for dbx500 platform was just (un)registering
with the thermal framework and that can be handled easily by the cpufreq
driver as well and in proper sequence as well.
Get rid of the cooling driver and its its users and manage everything
from the cpufreq driver instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is a missing newline in show_cpuinfo_cur_freq(), so add it,
but while at it clean that function up somewhat too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: All applicable <stable@vger.kernel.org>
Currently, intel_pstate_update_perf_limits() first converts the
policy minimum and maximum limits into percentages of the maximum
turbo frequency (rounding up to an integer) and then converts these
percentages to fractions (by using fixed-point arithmetic to divide
them by 100).
That introduces a rounding error unnecessarily, because the fractions
can be obtained by carrying out fixed-point divisions directly on the
input numbers.
Rework the computations in intel_pstate_hwp_set() to use fractions
instead of percentages (and drop redundant local variables from
there) and modify intel_pstate_update_perf_limits() to compute the
fractions directly and percentages out of them.
While at it, introduce percent_ext_fp() for converting percentages
to fractions (with extended number of fraction bits) and use it in
the computations.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the functions intel_pstate_hwp_set(), min/max range from HWP capability
MSR along with max_perf_pct and min_perf_pct, is used to set the HWP
request MSR. In some cases this doesn't result in the correct HWP max/min
in HWP request.
For example: In the following case:
HWP capabilities from MSR 0x771
0x70a1220
Here cpufreq min/max frequencies from above MSR dump are 700MHz and 3.2GHz
respectively.
This will result in
hwp_min = 0x07
hwp_max = 0x20
To limit max frequency to 2GHz:
perf_limits->max_perf_pct = 63 (2GHz as a percent of 3.2GHz rounded up)
With the current calculation:
adj_range = max_perf_pct * range / 100;
adj_range = 63 * (32 - 7) / 100
adj_range = 15
max = hw_min + adj_range;
max = 7 + 15 = 22
This will result in HWP request of 0x160f, which will result in a
frequency cap of 2.2GHz not 2GHz.
The problem with the above calculation is that hwp_min of 7 is treated
as 0% in the range. But max_perf_pct is calculated with respect to minimum
as 0 and max as 3.2GHz or hwp_max, so adding hwp_min to it will result in
more than the desired.
Since the min_perf_pct and max_perf_pct is already a percent of max
frequency or hwp_max, this min/max HWP request value can be calculated
directly applying these percentage to hwp_max.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix the debugfs interface for PID tuning to actually update
pid_params.sample_rate_ns on PID parameters updates, as changing
pid_params.sample_rate_ms via debugfs has no effect now.
Fixes: a4675fbc4a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
On some platforms, property device-type may be missed in soc node
in dts which caused the bus-frequency can not be obtained correctly.
This patch enhanced the bus-frequency calculation. When property
device-type is missed in dts, bus-frequency will be obtained by
looking up clock table to get platform clock and hence get its
frequency.
Signed-off-by: Tang Yuantian <andy.tang@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Mediatek MT8173 is just one of several SOCs from the same MT817x
family, including the 6-core (4-little/2-big) MT8176.
The mt8173-cpufreq driver supports all of these SOCs, however,
machines using them may use a different machine compatible.
Since this driver checks explicitly for the machine compatible
string, add support for the whole family.
Signed-off-by: Daniel Kurtz <djkurtz@chromium.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This function is only called once at boot by device_initcall(), so mark
it as __init.
Signed-off-by: Daniel Kurtz <djkurtz@chromium.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
intel_pstate_hwp_set_policy() is a wrapper around
intel_pstate_hwp_set(), but the only value it adds is to check
hwp_active before calling the latter and one of its two callers
has already checked hwp_active before that happens, so in that
code path the additional check is redundant and using the wrapper
is rather pointless.
For this reason, drop intel_pstate_hwp_set_policy() and make its
callers invoke intel_pstate_hwp_set() directly (after checking
hwp_active).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-cpufreq:
cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy
cpufreq: intel_pstate: Fix intel_pstate_verify_policy()
cpufreq: intel_pstate: Fix global settings in active mode
cpufreq: Add the "cpufreq.off=1" cmdline option
cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily
cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy()
cpufreq: intel_pstate: Do not use performance_limits in passive mode
If the current P-state selection algorithm is set to "performance"
in intel_pstate_set_policy(), the limits may be initialized from
scratch, but only if no_turbo is not set and the maximum frequency
allowed for the given CPU (i.e. the policy object representing it)
is at least equal to the max frequency supported by the CPU. In all
of the other cases, the limits will not be updated.
For example, the following can happen:
# cat intel_pstate/status
active
# echo performance > cpufreq/policy0/scaling_governor
# cat intel_pstate/min_perf_pct
100
# echo 94 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
100
# cat cpufreq/policy0/scaling_max_freq
3100000
echo 3000000 > cpufreq/policy0/scaling_max_freq
# cat intel_pstate/min_perf_pct
94
# echo 95 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
95
That is confusing for two reasons. First, the initial attempt to
change min_perf_pct to 94 seems to have no effect, even though
setting the global limits should always work. Second, after
changing scaling_max_freq for policy0 the global min_perf_pct
attribute shows 94, even though it should have not been affected
by that operation in principle.
Moreover, the final attempt to change min_perf_pct to 95 worked
as expected, because scaling_max_freq for the only policy with
scaling_governor equal to "performance" was different from the
maximum at that time.
To make all that confusion go away, modify intel_pstate_set_policy()
so that it doesn't reinitialize the limits at all.
At the same time, change intel_pstate_set_performance_limits() to
set min_sysfs_pct to 100 in the "performance" limits set so that
switching the P-state selection algorithm to "performance" causes
intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value
min_sysfs_pct in the "performance" limits is set to later).
That requires per-CPU limits to be initialized explicitly rather
than by copying the global limits to avoid setting min_sysfs_pct
in the per-CPU limits to 100.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The code added to intel_pstate_verify_policy() by commit 1443ebbacf
(cpufreq: intel_pstate: Fix sysfs limits enforcement for performance
policy) should use perf_limits instead of limits, because otherwise
setting global limits via sysfs may affect policies inconsistently.
For example, in the sequence of shell commands below, the
scaling_min_freq attribute for policy1 and policy2 should be
affected in the same way, because scaling_governor is set in
the same way for both of them:
# cat cpufreq/policy1/scaling_governor
powersave
# cat cpufreq/policy2/scaling_governor
powersave
# echo performance > cpufreq/policy0/scaling_governor
# echo 94 > intel_pstate/min_perf_pct
# cat cpufreq/policy0/scaling_min_freq
2914000
# cat cpufreq/policy1/scaling_min_freq
2914000
# cat cpufreq/policy2/scaling_min_freq
800000
The are affected differently, because intel_pstate_verify_policy()
is invoked with limits set to &performance_limits (left behind by
policy0) for policy1 and with limits set to &powersave_limits (left
behind by policy1) for policy2. Since perf_limits is set to the
set of limits matching the policy being updated, using it instead
of limits fixes the inconsistency.
Fixes: 1443ebbacf (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 111b8b3fe4 (cpufreq: intel_pstate: Always keep all
limits settings in sync) changed intel_pstate to invoke
cpufreq_update_policy() for every registered CPU on global sysfs
attributes updates, but that led to undesirable effects in the
active mode if the "performance" P-state selection algorithm is
configufred for one CPU and the "powersave" one is chosen for
all of the other CPUs.
Namely, in that case, the following is possible:
# cd /sys/devices/system/cpu/
# cat intel_pstate/max_perf_pct
100
# cat intel_pstate/min_perf_pct
26
# echo performance > cpufreq/policy0/scaling_governor
# cat intel_pstate/max_perf_pct
100
# cat intel_pstate/min_perf_pct
100
# echo 94 > intel_pstate/min_perf_pct
# cat intel_pstate/min_perf_pct
26
The reason why this happens is because intel_pstate attempts to
maintain two sets of global limits in the active mode, one for
the "performance" P-state selection algorithm and one for the
"powersave" P-state selection algorithm, but the P-state selection
algorithms are set per policy, so the global limits cannot reflect
all of them at the same time if they are different for different
policies.
In the particular situation above, the attempt to change
min_perf_pct to 94 caused cpufreq_update_policy() to be run
for a CPU with the "powersave" P-state selection algorithm
and intel_pstate_set_policy() called by it silently switched the
global limits to the "powersave" set which finally was reflected
by the sysfs interface.
To prevent that from happening, modify intel_pstate_update_policies()
to always switch back to the set of limits that was used right before
it has been invoked.
Fixes: 111b8b3fe4 (cpufreq: intel_pstate: Always keep all limits settings in sync)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add the "cpufreq.off=1" cmdline option.
At boot-time, this allows a user to request CONFIG_CPU_FREQ=n
behavior from a kernel built with CONFIG_CPU_FREQ=y.
This is analogous to the existing "cpuidle.off=1" option
and CONFIG_CPU_IDLE=y
This capability is valuable when we need to debug end-user
issues in the BIOS or in Linux. It is also convenient
for enabling comparisons, which may otherwise require a new kernel,
or help from BIOS SETUP, which may be buggy or unavailable.
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the passive mode the cpu_frequency trace event is already
triggered by the cpufreq core or by scaling governors, so
intel_pstate should not trigger it once again for the same
P-state updates.
In addition to that, the frequency returned by
intel_cpufreq_fast_switch() and passed via freqs.new from
intel_cpufreq_target() to cpufreq_freq_transition_end() should
reflect the P-state actually set, so make that happen.
Fixes: 001c76f05b (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The intel_pstate_update_perf_limits() called from
intel_cpufreq_verify_policy() may cause global P-state limits
to change which is generally confusing and unnecessary.
In the passive mode the global limits are only applied to the
frequency selected by the scaling governor (they are not taken
into account by governors when making decisions anyway), so making
them follow the per-policy limits serves no purpose and may go
against user expectations (as it generally causes the global
attributes in sysfs to change even though they have not been
written to in some cases).
Fix that by dropping the intel_pstate_update_perf_limits()
invocation from intel_cpufreq_verify_policy() (which also
reduces the code size by a few lines).
This change does not affect the per-CPU limits case, because those
limits allow any P-state to be set by default in the passive mode
and it removes the only piece of code updating them in that mode,
so the per-policy settings will be the only ones taken into account
in that case as expected.
Fixes: 001c76f05b (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Using performance_limits in the passive mode doesn't make
sense, because in that mode the global limits are applied to the
frequency selected by the scaling governor.
The maximum and minimum P-state limits in performance_limits are both
set to 100 percent which will put all CPUs into the turbo range
regardless of what governor is used and what frequencies are
selected by it (that is particularly undesirable on CPUs with the
generic powersave governor attached).
For this reason, make intel_pstate_register_driver() always point
limits to powersave_limits in the passive mode.
Fixes: 001c76f05b (cpufreq: intel_pstate: Generic governors support)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
These update turbostat significantly and in particular:
- Default output is now verbose, --debug is no longer required to
get all counters. As a result, some options have been added to
specify exactly what output is wanted.
- Added --quiet to skip system configuration output
- Added --list, --show and --hide parameters
- Added --cpu parameter
- Enhanced Baytrail SoC support
- Added Gemini Lake SoC support
- Added sysfs C-state columns
Also the symbol definitions in arch/x86/include/asm/intel-family.h
and arch/x86/include/asm/msr-index.h are updated and the intel_idle
and intel_pstate drivers are modified to use the updated symbols.
Credits to Len Brown for all of these changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYuLyNAAoJEILEb/54YlRxEvkQAJsggzpgGrlhrO6KHSm4yC9M
CqhBVsdeppX1ZTAVPiMk/pcXQYtL5fZ97ELk2So/CjT5Nh3jwDPMA/ux5n3uiob+
O2BTdtxnpNLxPQPQM1mW7Dr/uAIRlJug9gSMxKDbFSU9Oe3aET58PUdUTs7xaT59
nbtLxVSvzrdGk/bX6WO4ic+7F2licJLZPfDGhYidnoika8LxD4M+cIO73gFpgqQi
yoKrTZyLimvneFT0eAUUvHIyKjkJIxeMfslW57uBpz8rW5my+3UwsdpRG4AIVeWc
wSBlsNqj+TuR4BBiZ2VR2RoHF3qbH/SceI+k864BqyThfyK/g2q/vV/GvLZQCR/R
yWcajWD9kvLKvnm1D3XYOIQDBeP4l60j3vVwHytSvmaPYjn5Ms3jq6b+2K6zkXMM
8y3leW/hgw+rGCacdXPrKIlpBykSV7h+TnD2iMxeeDISNkbefWWDe/WB6HncocAg
HDtKRvU9ntRq6/MlnTKbCFM5c0oCXWRw4QNjDy3AsjJELgeAIwiqpHWMKO6XltFj
qU/rdyW/BTCuAlIjWVbjooAIJZ268geupeug3zvE3uGzrxT4DaVIo8W1wtJ+XQrt
By7sOW/gMQ2EcTJQiuFjS/Gz5gOKQ2F8OLCm6T8Prjh6SxrCUAiuIvP0LmxUCa8i
KMlx+8c9E2f9j+TTt9AP
=oMZe
-----END PGP SIGNATURE-----
Merge tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull turbostat utility updates from Rafael Wysocki:
"Power management turbostat utility updates.
These update turbostat significantly and in particular:
- default output is now verbose, --debug is no longer required to get
all counters. As a result, some options have been added to specify
exactly what output is wanted.
- added --quiet to skip system configuration output
- added --list, --show and --hide parameters
- added --cpu parameter
- enhanced Baytrail SoC support
- added Gemini Lake SoC support
- added sysfs C-state columns
Also the symbol definitions in arch/x86/include/asm/intel-family.h and
arch/x86/include/asm/msr-index.h are updated and the intel_idle and
intel_pstate drivers are modified to use the updated symbols.
Credits to Len Brown for all of these changes"
* tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (44 commits)
tools/power turbostat: version 17.02.24
tools/power turbostat: bugfix: --add u32 was printed as u64
tools/power turbostat: show error on exec
tools/power turbostat: dump p-state software config
tools/power turbostat: show package number, even without --debug
tools/power turbostat: support "--hide C1" etc.
tools/power turbostat: move --Package and --processor into the --cpu option
tools/power turbostat: turbostat.8 update
tools/power turbostat: update --list feature
tools/power turbostat: use wide columns to display large numbers
tools/power turbostat: Add --list option to show available header names
tools/power turbostat: fix zero IRQ count shown in one-shot command mode
tools/power turbostat: add --cpu parameter
tools/power turbostat: print sysfs C-state stats
tools/power turbostat: extend --add option to accept /sys path
tools/power turbostat: skip unused counters on BDX
tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits
tools/power turbostat: skip unused counters on SKX
tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7
tools/power turbostat: initial Gemini Lake SOC support
...
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that
did not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature
added recently that did not reinitialize P-state limits properly
when switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases
of system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYuLeGAAoJEILEb/54YlRxJvcP/0BRmh8Hn4Itx/NIWNg71X6j
U+v8Pn8T3MP33gCcleLYlgre2JUIAUDmhdK99+UOx+/abhjMhQSaF3HhTOwYaPtQ
6njoHVS0NnfqUf+x5kp+EpRxBVNucYVbdRTVd1DsHIeLLz/96DFOzb/R7tko/pKx
pFMWvNdotHLLgXOG1UvdRimwTDlFMffxFzD8Se53LPjRXS0S73A5VWfqZOye44Re
j3W1AJ0Idgq5uduA6J8x1MWbaxDq1h+j6CSUm05yvqrINzxXwXt0Hv6stCQTo+Gb
YMdiBd8MujNyAgcchw3jiDQ8Vp+zmfLPcHrfPe//SSefj26eB8LyVNSYelvbUdOz
cNjvyErva37MmaegCL9QC7WbLM+A7VE6bU6YzDCi/rR8jYMJ51Fb9jGiYb/oimry
OLlblEekikUsskWv4hGV1JVt5VhmUMlagWtexxn+lMszATcZro0tfXu/vgQWksYs
noUnwuWJWxvj2aNMsvbzW3HLlTGSmYl2UxJ7IymQQaTDblwF9Kg61rm3+5coUctd
ifceynDVp9Gju25faYgZ+Dq9+o8ktlOGOHRRPdLIRNJ/T+4tUDnlGkdbPb+Tfn03
XUIzYCu74U8/oW8gOk6t0WpmWzvxEXNgdirdEIR6y3loYIC0Jr3v4gyD975Eug74
Hzfrdg7ignAmWV+nf6UY
=SeUF
-----END PGP SIGNATURE-----
Merge tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates deom Rafael Wysocki:
"These fix two bugs introduced by recent power management updates (in
the cpuidle menu governor and intel_pstate) and a few other issues,
clean up things and remove unused code.
Specifics:
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that did
not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature added
recently that did not reinitialize P-state limits properly when
switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases of
system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet)"
* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / OPP: Documentation: Fix opp-microvolt in examples
intel_idle: stop exposing platform acronyms in sysfs
cpufreq: intel_pstate: Fix limits issue with operation mode switching
PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
PM / hibernate: Untangle power_down()
cpuidle: menu: Avoid taking spinlock for accessing QoS values
PM / QoS: Remove global notifiers
PM / runtime: Fix some typos
cpufreq: qoriq: clean up unused code
PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
PM / Domains: Power off masters immediately in the power off sequence
PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
PM / Domains: Move genpd_power_off() above genpd_power_on()
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the original intention of tsk_cpus_allowed() was to 'future-proof'
the field - but it's pretty ineffectual at that, because half of
the code uses ->cpus_allowed directly ...
Also, the wrapper makes the code longer than the original expression!
So just get rid of it. This also shrinks <linux/sched.h> a bit.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull changes related to turbostat for v4.11 from Len Brown.
* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (44 commits)
tools/power turbostat: version 17.02.24
tools/power turbostat: bugfix: --add u32 was printed as u64
tools/power turbostat: show error on exec
tools/power turbostat: dump p-state software config
tools/power turbostat: show package number, even without --debug
tools/power turbostat: support "--hide C1" etc.
tools/power turbostat: move --Package and --processor into the --cpu option
tools/power turbostat: turbostat.8 update
tools/power turbostat: update --list feature
tools/power turbostat: use wide columns to display large numbers
tools/power turbostat: Add --list option to show available header names
tools/power turbostat: fix zero IRQ count shown in one-shot command mode
tools/power turbostat: add --cpu parameter
tools/power turbostat: print sysfs C-state stats
tools/power turbostat: extend --add option to accept /sys path
tools/power turbostat: skip unused counters on BDX
tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits
tools/power turbostat: skip unused counters on SKX
tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7
tools/power turbostat: initial Gemini Lake SOC support
...
Originally, these MSRs were locally defined in this driver.
Now the definitions are in msr-index.h -- use them.
Signed-off-by: Len Brown <len.brown@intel.com>
There is a problem with intel_pstate operation mode switching
introduced by commit fb1fe1041c (cpufreq: intel_pstate: Operation
mode control from sysfs), because the global sysfs limits are
preserved across operation modes while per-policy limits are
reinitialized from scratch on a mode switch and both sets of limits
may get out of sync this way.
Fix that by always reinitializing the global limits upon the
registration of the driver.
Fixes: fb1fe1041c (cpufreq: intel_pstate: Operation mode control from sysfs)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
We sometimes collect non-critical fixes that come in during the later part
of the merge window in a branch for the next release instead, and this is
that contents for v4.11.
Most of these are OMAP fixes, dealing with OMAP36/37 detection, quirks
and setup. There's also some fixes for Davinci and a Kconfig fix for SCPI
to only enable on ARM{,64}.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYrMlHAAoJEIwa5zzehBx3oZ4P/3nRgb4dtwEwXwFmJf8Xd4nu
yetQbcwRreHvh8utsU2Pe+8tffV8jLgsW8TxZ43d6deYFii046HhZAXtvTTVgFpE
OA0fJpNJ00KYqP1Nx5q/kwZoH3uBz442uMUQ9lyziB3RpimhRsiKyHwnTyuWljyx
hPmO1XKcF6pQBXk1uwOzO1lSDUeOn4eAmeLonlG1gQ5qtrkU0WbrTPxpmn/CB546
LH5Nj0qVRzEa7xr8O+2nzeKPVwcXGwsKVKCDbSJmsey2KOEDnEjjxpToAh3WnA4W
Tm1av5QdyqsLVqAMkNYezrS8EzBjRKa1ma4xUqsNoIhO1XI7xa/LkonU8a0+ZdSX
p48DCvv7IHX5IqdIHHB0s1eICvTsW8Cp/4YUJzuZDFbS9B2t5b3412+n43tVa8l3
HYPeTzL5S3VOrMtpQKkGAFrw5OGm+URy4CYQxpX5DxSRSqvXTj12ajBHRbfdbzCO
r2i2rhKL07PF3DAf8L1coHcBQDS7Vc/k+fhKCQy+W1RDxmjYwYKSI9agOyZi1HQ7
X+0HuUyKTthCE2kUrj4rye/87MffWwdjNgnOZiHR1X7YtWgnjp1g9K+mLZHh/y5m
Tq/M55cK9h6dOghx121jYFkkvDclEQDemJuDbKY0sEMDrDXtppcI/T+znZ1LTq7i
1eaK4lTyAX7dbQJUQCwe
=NhZq
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes-nc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC non-urgent fixes from Arnd Bergmann:
"We sometimes collect non-critical fixes that come in during the later
part of the merge window in a branch for the next release instead, and
this is that contents for v4.11.
Most of these are OMAP fixes, dealing with OMAP36/37 detection, quirks
and setup. There's also some fixes for Davinci and a Kconfig fix for
SCPI to only enable on ARM{,64}"
* tag 'armsoc-fixes-nc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
firmware: arm_scpi: Add hardware dependencies
ARM: OMAP3: Fix SoC detection of OMAP36/37 Family
ARM: OMAP5: Add HWMOD_SWSUP_SIDLE_ACT flag for UART
ARM: dts: Fix compatible for ti81xx uarts for 8250
ARM: dts: Fix am335x and dm814x scm syscon to probe children
ARM: OMAP2+: Fix init for multiple quirks for the same SoC
ARM: dts: Fix omap3 off mode pull defines
bus: da850-mstpri: fix my e-mail address
ARM: davinci: da850: fix da850_set_pll0rate()
ARM: davinci: da850: coding style fix
This snip code is not needed anymore since its user
get_hard_smp_processor_id() has been removed.
Signed-off-by: Tang Yuantian <yuantian.tang@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Operating Performance Points (OPP) framework fixes, cleanups and
switch over from RCU-based synchronization to reference counting
using krefs (Viresh Kumar, Wei Yongjun, Dave Gerlach).
- cpufreq core cleanups and documentation updates (Viresh Kumar,
Rafael Wysocki).
- New cpufreq driver for Broadcom BMIPS SoCs (Markus Mayer).
- New cpufreq-dt sub-driver for TI SoCs requiring special handling,
like in the AM335x, AM437x, DRA7x, and AM57x families, along with
new DT bindings for it (Dave Gerlach, Paul Gortmaker).
- ARM64 SoCs support for the qoriq cpufreq driver (Tang Yuantian).
- intel_pstate driver updates including a new sysfs knob to control
the driver's operation mode and fixes related to the no_turbo
sysfs knob and the hardware-managed P-states feature support
(Rafael Wysocki, Srinivas Pandruvada).
- New interface to export ultra-turbo frequencies for the powernv
cpufreq driver (Shilpasri Bhat).
- Assorted fixes for cpufreq drivers (Arnd Bergmann, Dan Carpenter,
Wei Yongjun).
- devfreq core fixes, mostly related to the sysfs interface exported
by it (Chanwoo Choi, Chris Diamand).
- Updates of the exynos-bus and exynos-ppmu devfreq drivers (Chanwoo
Choi).
- Device PM QoS extension to support CPUs and support for per-CPU
wakeup (device resume) latency constraints in the cpuidle menu
governor (Alex Shi).
- Wakeup IRQs framework fixes (Grygorii Strashko).
- Generic power domains framework update including a fix to make
it handle asynchronous invocations of *noirq suspend/resume
callbacks correctly (Ulf Hansson, Geert Uytterhoeven).
- Assorted fixes and cleanups in the core suspend/hibernate code,
PM QoS framework and x86 ACPI idle support code (Corentin Labbe,
Geert Uytterhoeven, Geliang Tang, John Keeping, Nick Desaulniers).
- Update of the analyze_suspend.py script is updated to version 4.5
offering multiple improvements (Todd Brandt).
- New tool for intel_pstate diagnostics using the pstate_sample
tracepoint (Doug Smythies).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYq3IjAAoJEILEb/54YlRx/lYP+gNXhfETSzjd4kWSHy3FVEDb
gc5rMiE2j0OYgVSXwBI7p4EqMPy56lSWBASvbF2o6v9CIxb880KLFEsMDCVHwn46
6xfEnIRxf1oeRqn7EG9ZPIcTgNsUyvK+gah7zgLXu/0KU7ceXxygvNk47qpeOZ8f
dKYgIk/TOSGPC8H2nsg8VBKlK/ZOj5hID4F3MmFw6yDuWVCYuh2EokYXS4Nx0JwY
UQGpWtz+FWWs71vhgVl33GbPXWvPqA7OMe0btZ3RCnhnz4tA/mH+jDWiaspCdS3J
vKGeZyZptjIMJcufm3X7s7ghYjELheqQusMODDXk4AaWQ5nz8V5/h7NThYfa9J1b
M93Tb0rMb2MqUhBpv/M6D3qQroZmhq55QKfQrul3QWSOiQUzTWJcbbpyeBQ7nkrI
F1qNqQfuCnBL/r9y7HpW8P2iFg9kCHkwTtXMdp/lzGXdKzSGtAUSkYg5ohnUzQTp
2WCPTEk+5DxLVPjW5rDoZOotr5p1kdcdWBk6r3MEWRokZK6PJo7rJBcnTtXSo2mO
lLRba006q+fTlI5wZtjAI0rOiS3JgtT6cRx7uPjZlze9TGjklJhdsCPJbM5gcOT+
YiOxvqD+9if5QRSxiEZNj3bQ43wYhXmpctfIanyxziq09BPIPxvgfRR/BkUzc34R
ps4CIvImim5v5xc8Zsbk
=57xJ
-----END PGP SIGNATURE-----
Merge tag 'pm-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes go into the Operating Performance Points (OPP)
framework and cpufreq this time, followed by devfreq and some
scattered updates all over.
The OPP changes are mostly related to switching over from RCU-based
synchronization, that turned out to be overly complicated and
problematic, to reference counting using krefs.
In the cpufreq land there are core cleanups, documentation updates, a
new driver for Broadcom BMIPS SoCs, a new cpufreq-dt sub-driver for TI
SoCs that require special handling, ARM64 SoCs support for the qoriq
driver, intel_pstate updates, powernv driver update and assorted
fixes.
The devfreq changes are mostly fixes related to the sysfs interface
and some Exynos drivers updates.
Apart from that, the cpuidle menu governor will support per-CPU PM QoS
constraints for the wakeup latency now, some bugs in the wakeup IRQs
framework are fixed, the generic power domains framework should handle
asynchronous invocations of *noirq suspend/resume callbacks from now
on, the analyze_suspend.py script is updated and there is a new tool
for intel_pstate diagnostics.
Specifics:
- Operating Performance Points (OPP) framework fixes, cleanups and
switch over from RCU-based synchronization to reference counting
using krefs (Viresh Kumar, Wei Yongjun, Dave Gerlach)
- cpufreq core cleanups and documentation updates (Viresh Kumar,
Rafael Wysocki)
- New cpufreq driver for Broadcom BMIPS SoCs (Markus Mayer)
- New cpufreq-dt sub-driver for TI SoCs requiring special handling,
like in the AM335x, AM437x, DRA7x, and AM57x families, along with
new DT bindings for it (Dave Gerlach, Paul Gortmaker)
- ARM64 SoCs support for the qoriq cpufreq driver (Tang Yuantian)
- intel_pstate driver updates including a new sysfs knob to control
the driver's operation mode and fixes related to the no_turbo sysfs
knob and the hardware-managed P-states feature support (Rafael
Wysocki, Srinivas Pandruvada)
- New interface to export ultra-turbo frequencies for the powernv
cpufreq driver (Shilpasri Bhat)
- Assorted fixes for cpufreq drivers (Arnd Bergmann, Dan Carpenter,
Wei Yongjun)
- devfreq core fixes, mostly related to the sysfs interface exported
by it (Chanwoo Choi, Chris Diamand)
- Updates of the exynos-bus and exynos-ppmu devfreq drivers (Chanwoo
Choi)
- Device PM QoS extension to support CPUs and support for per-CPU
wakeup (device resume) latency constraints in the cpuidle menu
governor (Alex Shi)
- Wakeup IRQs framework fixes (Grygorii Strashko)
- Generic power domains framework update including a fix to make it
handle asynchronous invocations of *noirq suspend/resume callbacks
correctly (Ulf Hansson, Geert Uytterhoeven)
- Assorted fixes and cleanups in the core suspend/hibernate code, PM
QoS framework and x86 ACPI idle support code (Corentin Labbe, Geert
Uytterhoeven, Geliang Tang, John Keeping, Nick Desaulniers)
- Update of the analyze_suspend.py script is updated to version 4.5
offering multiple improvements (Todd Brandt)
- New tool for intel_pstate diagnostics using the pstate_sample
tracepoint (Doug Smythies)"
* tag 'pm-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (85 commits)
MAINTAINERS: cpufreq: add bmips-cpufreq.c
PM / QoS: Fix memory leak on resume_latency.notifiers
PM / Documentation: Spelling s/wrtie/write/
PM / sleep: Fix test_suspend after sleep state rework
cpufreq: CPPC: add ACPI_PROCESSOR dependency
cpufreq: make ti-cpufreq explicitly non-modular
cpufreq: Do not clear real_cpus mask on policy init
tools/power/x86: Debug utility for intel_pstate driver
AnalyzeSuspend: fix drag and zoom bug in javascript
PM / wakeirq: report a wakeup_event on dedicated wekup irq
PM / wakeirq: Fix spurious wake-up events for dedicated wakeirqs
PM / wakeirq: Enable dedicated wakeirq for suspend
cpufreq: dt: Don't use generic platdev driver for ti-cpufreq platforms
cpufreq: ti: Add cpufreq driver to determine available OPPs at runtime
Documentation: dt: add bindings for ti-cpufreq
PM / OPP: Expose _of_get_opp_desc_node as dev_pm_opp API
cpufreq: qoriq: Don't look at clock implementation details
cpufreq: qoriq: add ARM64 SoCs support
PM / Domains: Provide dummy governors if CONFIG_PM_GENERIC_DOMAINS=n
cpufreq: brcmstb-avs-cpufreq: remove unnecessary platform_set_drvdata()
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
Without the Kconfig dependency, we can get this warning:
warning: ACPI_CPPC_CPUFREQ selects ACPI_CPPC_LIB which has unmet direct dependencies (ACPI && ACPI_PROCESSOR)
Fixes: 5477fb3bd1 (ACPI / CPPC: Add a CPUFreq driver for use with CPPC)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Kconfig currently controlling compilation of this code is:
drivers/cpufreq/Kconfig.arm:config ARM_TI_CPUFREQ
drivers/cpufreq/Kconfig.arm: bool "Texas Instruments CPUFreq support"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If new_policy is set in cpufreq_online(), the policy object has just
been created and its real_cpus mask has been zeroed on allocation,
and the driver's ->init() callback should not touch it.
It doesn't need to be cleared again, so don't do that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Some TI platforms, specifically those in the am33xx, am43xx, dra7xx, and
am57xx families of SoCs can make use of the ti-cpufreq driver to
selectively enable OPPs based on the exact configuration in use. The
ti-cpufreq is given the responsibility of creating the cpufreq-dt
platform device when the driver is in use so drop am33xx and dra7xx
from the cpufreq-dt-platdev driver so it is not created twice.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dave Gerlach <d-gerlach@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some TI SoCs, like those in the AM335x, AM437x, DRA7x, and AM57x families,
have different OPPs available for the MPU depending on which specific
variant of the SoC is in use. This can be determined through use of the
revision and an eFuse register present in the silicon. Introduce a
ti-cpufreq driver that can read the aformentioned values and provide
them as version matching data to the opp framework. Through this the
opp-supported-hw dt binding that is part of the operating-points-v2
table can be used to indicate availability of OPPs for each device.
This driver also creates the "cpufreq-dt" platform_device after passing
the version matching data to the OPP framework so that the cpufreq-dt
handles the actual cpufreq implementation. Even without the necessary
data to pass the version matching data the driver will still create this
device to maintain backwards compatibility with operating-points v1
tables.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Dave Gerlach <d-gerlach@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Get the CPU clock's potential parent clocks from the clock interface
itself, rather than manually parsing the clocks property to find a
phandle, looking at the clock-names property of that, and assuming that
those are valid parent clocks for the cpu clock.
This is necessary now that the clocks are generated based on the clock
driver's knowledge of the chip rather than a fragile device-tree
description of the mux options.
We can now rely on the clock driver to ensure that the mux only exposes
options that are valid. The cpufreq driver was currently being overly
conservative in some cases -- for example, the "min_cpufreq =
get_bus_freq()" restriction only applies to chips with erratum
A-004510, and whether the freq_mask used on p5020 is needed depends on
the actual frequencies of the PLLs (FWIW, p5040 has a similar
limitation but its .freq_mask was zero) -- and the frequency mask
mechanism made assumptions about particular parent clock indices that
are no longer valid.
Signed-off-by: Scott Wood <scottwood@nxp.com>
Signed-off-by: Tang Yuantian <yuantian.tang@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add ARM64 config to Kconfig to enable CPU frequency feature on
NXP ARM64 SoCs.
Signed-off-by: Tang Yuantian <yuantian.tang@nxp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The "goto err_armclk;" error path already does a clk_put(s3c_freq->hclk);
so this is a double free.
Fixes: 34ee550752 ([CPUFREQ] Add S3C2416/S3C2450 cpufreq driver)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add the MIPS CPUfreq driver. This driver currently supports CPUfreq on
BMIPS5xxx-based SoCs.
Signed-off-by: Markus Mayer <mmayer@broadcom.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some Kabylake desktop processors may not reach max turbo when running in
HWP mode, even if running under sustained 100% utilization.
This occurs when the HWP.EPP (Energy Performance Preference) is set to
"balance_power" (0x80) -- the default on most systems.
It occurs because the platform BIOS may erroneously enable an
energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not
recommended to be enabled on this SKU.
On the failing systems, this BIOS issue was not discovered when the
desktop motherboard was tested with Windows, because the BIOS also
neglects to provide the ACPI/CPPC table, that Windows requires to enable
HWP, and so Windows runs in legacy P-state mode, where this setting has
no effect.
Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and
so it runs in HWP mode, exposing this incorrect BIOS configuration.
There are several ways to address this problem.
First, Linux can also run in legacy P-state mode on this system.
As intel_pstate is how Linux enables HWP, booting with
"intel_pstate=disable"
will run in acpi-cpufreq/ondemand legacy p-state mode.
Or second, the "performance" governor can be used with intel_pstate,
which will modify HWP.EPP to 0.
Or third, starting in 4.10, the
/sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
attribute in can be updated from "balance_power" to "performance".
Or fourth, apply this patch, which fixes the erroneous setting of
MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default
configuration to function as designed.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When HWP is active, turbo activation ratio is not used to calculate max
non turbo ratio. But on these systems the max non turbo ratio is decided
by config TDP settings.
This change removes usage of MSR_TURBO_ACTIVATION_RATIO for HWP systems,
instead directly use TDP ratios, when more than one TDPs are available.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Under HWP the performance limits are calculated using max_perf_pct
and min_perf_pct using possible performance, not available performance.
The available performance can be reduced by no_turbo setting. To make
compatible with legacy mode, use max/min performance percentage with
respect to available performance.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When turbo is not disabled by BIOS, but user disabled from intel P-State
sysfs and changes max/min using cpufreq sysfs, the resultant frequency
is lower than what user requested.
The reason for this, when the perf limits are calculated in set_policy()
callback, they are with reference to max cpu frequency (turbo frequency
), but when enforced in the intel_pstate_get_min_max() they are with
reference to max available performance as documented in the intel_pstate
documentation (in this case max non turbo P-State).
This needs similar change as done in intel_cpufreq_verify_policy() for
passive mode. Set policy->cpuinfo.max_freq based on the turbo status.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make it possible to change the operation mode of intel_pstate with
the help of a new sysfs attribute called "status".
There are three possible configurations that can be selected using
this attribute:
"off" - The driver is not in use at this time.
"active" - The driver works as a P-state governor (default).
"passive" - The driver works as a regular cpufreq one and collaborates
with the generic cpufreq governors (it sets P-states as
requested by those governors). [This is the same mode
the driver can be started in by passing intel_pstate=passive
in the kernel command line.]
The current setting is returned by reads from this attribute. Writing
one of the above strings to it changes the operation mode as indicated
by that string, if possible.
If HW-managed P-states (HWP) feature is enabled, it is not possible
to change the driver's operation mode and attempts to write to this
attribute will fail.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Expose the intel_pstate's global sysfs attributes before registering
the driver to prepare for the addition of an attribute that also will
have to work if the driver is not registered.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In P8+, Workload Optimized Frequency(WOF) provides the capability to
boost the cpu frequency based on the utilization of the other cpus
running in the chip. The On-Chip-Controller(OCC) firmware will control
the achievability of these frequencies depending on the power headroom
available in the chip. Currently the ultra-turbo frequencies provided
by this feature are exported along with the turbo and sub-turbo
frequencies as scaling_available_frequencies. This patch will export
the ultra-turbo frequencies separately as scaling_boost_frequencies in
WOF enabled systems. This patch will add the boost sysfs file which
can be used to disable/enable ultra-turbo frequencies.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This doesn't have any benefit apart from saving a small amount of memory
when it is disabled. The ifdef hackery in the code makes it dirty
unnecessarily.
Clean it up by removing the Kconfig option completely. Few defconfigs
are also updated and CONFIG_CPU_FREQ_STAT_DETAILS is replaced with
CONFIG_CPU_FREQ_STAT now in them, as users wanted stats to be enabled.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Those were added by:
commit fcd7af917a ("cpufreq: stats: handle cpufreq_unregister_driver()
and suspend/resume properly")
but aren't used anymore since:
commit 1aefc75b24 ("cpufreq: stats: Make the stats code non-modular").
Remove them. Also remove the redundant parameter to the respective
routines.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.
Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch updates dev_pm_opp_find_freq_*() routines to get a reference
to the OPPs returned by them.
Also updates the users of dev_pm_opp_find_freq_*() routines to call
dev_pm_opp_put() after they are done using the OPPs.
As it is guaranteed the that OPPs wouldn't get freed while being used,
the RCU read side locking present with the users isn't required anymore.
Drop it as well.
This patch also updates all users of devfreq_recommended_opp() which was
returning an OPP received from the OPP core.
Note that some of the OPP core routines have gained
rcu_read_{lock|unlock}() calls, as those still use RCU specific APIs
within them.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com> [Devfreq]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that we have proper kernel reference infrastructure in place for OPP
tables, use it to guarantee that the OPP table isn't freed while being
used by the callers of dev_pm_opp_set_*() APIs.
Make them all return the pointer to the OPP table after taking its
reference and put the reference back with dev_pm_opp_put_*() APIs.
Now that the OPP table wouldn't get freed while these routines are
executing after dev_pm_opp_get_opp_table() is called, there is no need
to take opp_table_lock. Drop them as well.
Remove the rcu specific comments from these routines as they aren't
relevant anymore.
Note that prototypes of dev_pm_opp_{set|put}_regulators() were already
updated by another patch.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>