OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Rafael J. Wysocki	e5af36b2ad	cpufreq: intel_pstate: Use HWP if enabled by platform firmware It turns out that there are systems where HWP is enabled during initialization by the platform firmware (BIOS), but HWP EPP support is not advertised. After commit `7aa1031223` ("cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported") intel_pstate refuses to use HWP on those systems, but the fallback PERF_CTL interface does not work on them either because of enabled HWP, and once enabled, HWP cannot be disabled. Consequently, the users of those systems cannot control CPU performance scaling. Address this issue by making intel_pstate use HWP unconditionally if it is enabled already when the driver starts. Fixes: `7aa1031223` ("cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+	2021-05-10 13:22:39 +02:00
Rafael J. Wysocki	b989bc0f3c	cpufreq: intel_pstate: Simplify intel_pstate_update_perf_limits() Because pstate.max_freq is always equal to the product of pstate.max_pstate and pstate.scaling and, analogously, pstate.turbo_freq is always equal to the product of pstate.turbo_pstate and pstate.scaling, the result of the max_policy_perf computation in intel_pstate_update_perf_limits() is always equal to the quotient of policy_max and pstate.scaling, regardless of whether or not turbo is disabled. Analogously, the result of min_policy_perf in intel_pstate_update_perf_limits() is always equal to the quotient of policy_min and pstate.scaling. Accordingly, intel_pstate_update_perf_limits() need not check whether or not turbo is enabled at all and in order to compute max_policy_perf and min_policy_perf it can always divide policy_max and policy_min, respectively, by pstate.scaling. Make it do so. While at it, move the definition and initialization of the turbo_max local variable to the code branch using it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com>	2021-04-09 17:16:12 +02:00
Rafael J. Wysocki	de5bcf404a	cpufreq: intel_pstate: Clean up frequency computations Notice that some computations related to frequency in intel_pstate can be simplified if (a) intel_pstate_get_hwp_max() updates the relevant members of struct cpudata by itself and (b) the "turbo disabled" check is moved from it to its callers, so modify the code accordingly and while at it rename intel_pstate_get_hwp_max() to intel_pstate_get_hwp_cap() which better reflects its purpose and provide a simplified variat of it, __intel_pstate_get_hwp_cap(), suitable for the initialization path. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com>	2021-03-23 19:41:14 +01:00
Nigel Christian	75a8d877d6	cpufreq: intel_pstate: Remove repeated word In the comment for trace in passive mode there is an unnecessary "the". Eradicate it. Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-01-22 17:04:56 +01:00
Chen Yu	6f67e06008	cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available Currently, when turbo is disabled (either by BIOS or by the user), the intel_pstate driver reads the max non-turbo frequency from the package-wide MSR_PLATFORM_INFO(0xce) register. However, on asymmetric platforms it is possible in theory that small and big core with HWP enabled might have different max non-turbo CPU frequency, because MSR_HWP_CAPABILITIES is per-CPU scope according to Intel Software Developer Manual. The turbo max freq is already per-CPU in current code, so make similar change to the max non-turbo frequency as well. Reported-by: Wendy Wang <wendy.wang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> [ rjw: Subject and changelog edits ] Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: a45ee4d4e13b: cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-01-12 19:44:49 +01:00
Rafael J. Wysocki	597ffbc8d0	cpufreq: intel_pstate: Rename two functions Rename intel_cpufreq_adjust_hwp() and intel_cpufreq_adjust_perf_ctl() to intel_cpufreq_hwp_update() and intel_cpufreq_perf_ctl_update(), respectively, to avoid possible confusion with the ->adjist_perf() callback function, intel_cpufreq_adjust_perf(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com>	2021-01-12 19:34:05 +01:00
Rafael J. Wysocki	a45ee4d4e1	cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument All of the callers of intel_pstate_get_hwp_max() access the struct cpudata object that corresponds to the given CPU already and the function itself needs to access that object (in order to update hwp_cap_cached), so modify the code to pass a struct cpudata pointer to it instead of the CPU number. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com>	2021-01-12 19:34:04 +01:00
Rafael J. Wysocki	9dd04ec6bc	cpufreq: intel_pstate: Always read hwp_cap_cached with READ_ONCE() Because intel_pstate_get_hwp_max() which updates hwp_cap_cached may run in parallel with the readers of it, annotate all of the read accesses to it with READ_ONCE(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com>	2021-01-12 19:34:04 +01:00
Lukas Bulwahn	c4151604f0	cpufreq: intel_pstate: remove obsolete functions percent_fp() was used in intel_pstate_pid_reset(), which was removed in commit `9d0ef7af1f` ("cpufreq: intel_pstate: Do not use PID-based P-state selection") and hence, percent_fp() is unused since then. percent_ext_fp() was last used in intel_pstate_update_perf_limits(), which was refactored in commit `1a4fe38add` ("cpufreq: intel_pstate: Remove max/min fractions to limit performance"), and hence, percent_ext_fp() is unused since then. make CC=clang W=1 points us those unused functions: drivers/cpufreq/intel_pstate.c:79:23: warning: unused function 'percent_fp' [-Wunused-function] static inline int32_t percent_fp(int percent) ^ drivers/cpufreq/intel_pstate.c:94:23: warning: unused function 'percent_ext_fp' [-Wunused-function] static inline int32_t percent_ext_fp(int percent) ^ Remove those obsolete functions. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-01-07 18:22:46 +01:00
Rafael J. Wysocki	17ffd35809	cpufreq: intel_pstate: Use HWP capabilities in intel_cpufreq_adjust_perf() If turbo P-states cannot be used, either due to the configuration of the processor, or because intel_pstate is not allowed to used them, the maximum available P-state with HWP enabled corresponds to the HWP_CAP.GUARANTEED value which is not static. It can be adjusted by an out-of-band agent or during an Intel Speed Select performance level change, so long as it remains less than or equal to HWP_CAP.MAX. However, if turbo P-states cannot be used, intel_cpufreq_adjust_perf() always uses pstate.max_pstate (set during the initialization of the driver only) as the maximum available P-state, so it may miss a change of the HWP_CAP.GUARANTEED value. Prevent that from happening by modifyig intel_cpufreq_adjust_perf() to always read the "guaranteed" and "maximum turbo" performance levels from the cached HWP_CAP value. Fixes: `a365ab6b9d` ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2021-01-07 17:34:32 +01:00
Rafael J. Wysocki	be1283454b	cpufreq: intel_pstate: Fix fast-switch fallback path When sugov_update_single_perf() falls back to the "frequency" path due to the missing scale-invariance, it will call cpufreq_driver_fast_switch() via sugov_fast_switch() and the driver's ->fast_switch() callback will be invoked, so it must not be NULL. However, after commit `a365ab6b9d` ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to NULL when it is going to use intel_cpufreq_adjust_perf(), which is a mistake, because on x86 the scale-invariance may be turned off dynamically, so modify it to retain the original ->adjust_perf() callback pointer. Fixes: `a365ab6b9d` ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback") Reported-by: Kenneth R. Crudup <kenny@panix.com> Tested-by: Kenneth R. Crudup <kenny@panix.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-12-30 18:22:17 +01:00
Rafael J. Wysocki	e40ad84c26	cpufreq: intel_pstate: Use most recent guaranteed performance values When turbo has been disabled by the BIOS, but HWP_CAP.GUARANTEED is changed later, user space may want to take advantage of this increased guaranteed performance. HWP_CAP.GUARANTEED is not a static value. It can be adjusted by an out-of-band agent or during an Intel Speed Select performance level change. The HWP_CAP.MAX is still the maximum achievable performance with turbo disabled by the BIOS, so HWP_CAP.GUARANTEED can still change as long as it remains less than or equal to HWP_CAP.MAX. When HWP_CAP.GUARANTEED is changed, the sysfs base_frequency attribute shows the most recent guaranteed frequency value. This attribute can be used by user space software to update the scaling min/max limits of the CPU. Currently, the ->setpolicy() callback already uses the latest HWP_CAP values when setting HWP_REQ, but the ->verify() callback will restrict the user settings to the to old guaranteed performance value which prevents user space from making use of the extra CPU capacity theoretically available to it after increasing HWP_CAP.GUARANTEED. To address this, read HWP_CAP in intel_pstate_verify_cpu_policy() to obtain the maximum P-state that can be used and use that to confine the policy max limit instead of using the cached and possibly stale pstate.max_freq value for this purpose. For consistency, update intel_pstate_update_perf_limits() to use the maximum available P-state returned by intel_pstate_get_hwp_max() to compute the maximum frequency instead of using the return value of intel_pstate_get_max_freq() which, again, may be stale. This issue is a side-effect of fixing the scaling frequency limits in commit `eacc9c5a92` ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") which corrected the setting of the reduced scaling frequency values, but caused stale HWP_CAP.GUARANTEED to be used in the case at hand. Fixes: `eacc9c5a92` ("cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled") Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.8+ <stable@vger.kernel.org> # 5.8+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-12-21 10:51:00 +01:00
Rafael J. Wysocki	a365ab6b9d	cpufreq: intel_pstate: Implement the ->adjust_perf() callback Make intel_pstate expose the ->adjust_perf() callback when it operates in the passive mode with HWP enabled which causes the schedutil governor to use that callback instead of ->fast_switch(). The minimum and target performance-level values passed by the governor to ->adjust_perf() are converted to HWP.REQ.MIN and HWP.REQ.DESIRED, respectively, which allows the processor to adjust its configuration to maximize energy-efficiency while providing sufficient capacity. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2020-12-15 19:24:19 +01:00
Rafael J. Wysocki	2554c32f0b	cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate() Avoid doing the same assignment in both branches of a conditional, do it after the whole conditional instead. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-12-11 19:53:58 +01:00
Rafael J. Wysocki	fcb3a1ab79	cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account Make intel_pstate take the new CPUFREQ_GOV_STRICT_TARGET governor flag into account when it operates in the passive mode with HWP enabled, so as to fix the "powersave" governor behavior in that case (currently, HWP is allowed to scale the performance all the way up to the policy max limit when the "powersave" governor is used, but it should be constrained to the policy min limit then). Fixes: `f6ebbcf08f` ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: `9a2a9ebc0a` cpufreq: Introduce governor flags Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: `218f668701` cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: `ea9364bbad` cpufreq: Add strict_target to struct cpufreq_policy	2020-11-10 18:36:17 +01:00
Rafael J. Wysocki	e0be38ed4a	cpufreq: intel_pstate: Avoid missing HWP max updates in passive mode If the cpufreq policy max limit is changed when intel_pstate operates in the passive mode with HWP enabled and the "powersave" governor is used on top of it, the HWP max limit is not updated as appropriate. Namely, in the "powersave" governor case, the target P-state is always equal to the policy min limit, so if the latter does not change, intel_cpufreq_adjust_hwp() is not invoked to update the HWP Request MSR due to the "target_pstate != old_pstate" check in intel_cpufreq_update_pstate(), so the HWP max limit is not updated as a result. Also, if the CPUFREQ_NEED_UPDATE_LIMITS flag is not set for the driver and the target frequency does not change along with the policy max limit, the "target_freq == policy->cur" check in __cpufreq_driver_target() prevents the driver's ->target() callback from being invoked at all, so the HWP max limit is not updated. To prevent that occurring, set the CPUFREQ_NEED_UPDATE_LIMITS flag in the intel_cpufreq driver structure if HWP is enabled and modify intel_cpufreq_update_pstate() to do the "target_pstate != old_pstate" check only in the non-HWP case and let intel_cpufreq_adjust_hwp() always run in the HWP case (it will update HWP Request only if the cached value of the register is different from the new one including the limits, so if neither the target P-state value nor the max limit changes, the register write will still be avoided). Fixes: `f6ebbcf08f` ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Reported-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: `1c534352f4` cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Zhang Rui <rui.zhang@intel.com>	2020-10-27 18:53:35 +01:00
Chen Yu	cdc1719cd8	cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver There is a corner case that if the intel_pstate driver fails to be registered (might be due to invalid MSR access) and acpi_cpufreq takse over, the intel_pstate sysfs interface is still populated which may confuse user space (turbostat for example): grep . /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver acpi-cpufreq grep . /sys/devices/system/cpu/intel_pstate/* /sys/devices/system/cpu/intel_pstate/max_perf_pct:0 /sys/devices/system/cpu/intel_pstate/min_perf_pct:0 grep: /sys/devices/system/cpu/intel_pstate/no_turbo: Resource temporarily unavailable grep: /sys/devices/system/cpu/intel_pstate/num_pstates: Resource temporarily unavailable /sys/devices/system/cpu/intel_pstate/status:off grep: /sys/devices/system/cpu/intel_pstate/turbo_pct: Resource temporarily unavailable The mere presence of the intel_pstate sysfs interface does not mean that intel_pstate is in use (for example, echo "off" to "status"), but it should not be created in the failing case. Fix this issue by deleting the intel_pstate sysfs if the driver registration fails. Reported-by: Wendy Wang <wendy.wang@intel.com> Suggested-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com [ rjw: Refactor code to avoid jumps, change function name, changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-10-16 16:33:12 +02:00
Zhang Rui	fc7d17551f	cpufreq: intel_pstate: Fix missing return statement Fix missing return statement when writing "off" to intel_pstate status sysfs I/F. Fixes: `55671ea325` ("cpufreq: intel_pstate: Free memory only when turning off") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-09-30 17:37:23 +02:00
Francisco Jerez	eacc9c5a92	cpufreq: intel_pstate: Fix intel_pstate_get_hwp_max() for turbo disabled This fixes the behavior of the scaling_max_freq and scaling_min_freq sysfs files in systems which had turbo disabled by the BIOS. Caleb noticed that the HWP is programmed to operate in the wrong P-state range on his system when the CPUFREQ policy min/max frequency is set via sysfs. This seems to be because in his system intel_pstate_get_hwp_max() is returning the maximum turbo P-state even though turbo was disabled by the BIOS, which causes intel_pstate to scale kHz frequencies incorrectly e.g. setting the maximum turbo frequency whenever the maximum guaranteed frequency is requested via sysfs. Tested-by: Caleb Callaway <caleb.callaway@intel.com> Signed-off-by: Francisco Jerez <currojerez@riseup.net> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Minor subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-09-01 21:15:00 +02:00
Rafael J. Wysocki	55671ea325	cpufreq: intel_pstate: Free memory only when turning off When intel_pstate switches the operation mode from "active" to "passive" or the other way around, freeing its data structures representing CPUs and allocating them again from scratch is not necessary and wasteful. Moreover, if these data structures are preserved, the cached HWP Request MSR value from there may be written to the MSR to start with to reinitialize it and help to restore the EPP value set previously (it is set to 0xFF when CPUs go offline to allow their SMT siblings to use the full range of EPP values and that also happens when the driver gets unregistered). Accordingly, modify the driver to only do a full cleanup on driver object registration errors and when its status is changed to "off" via sysfs and to write the cached HWP Request MSR value back to the MSR on CPU init if the data structure representing the given CPU is still there. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2020-09-01 21:14:52 +02:00
Rafael J. Wysocki	4adcf2e582	cpufreq: intel_pstate: Add ->offline and ->online callbacks Add ->offline and ->online driver callbacks to prepare for taking a CPU offline and to restore its working configuration when it goes back online, respectively, to avoid invoking the ->init callback on every CPU online which is quite a bit of unnecessary overhead. Define ->offline and ->online so that they can be used in the passive mode as well as in the active mode and because ->offline will do the majority of ->stop_cpu work, the passive mode does not need that callback any more, so drop it from there. Also modify the active mode ->suspend and ->resume callbacks to prevent them from interfering with the new ->offline and ->online ones in case the latter are invoked withing the system-wide suspend and resume code flow and make the passive mode use them too. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2020-09-01 21:14:45 +02:00
Rafael J. Wysocki	b388eb58ce	cpufreq: intel_pstate: Tweak the EPP sysfs interface Modify the EPP sysfs interface to reject attempts to change the EPP to values different from 0 ("performance") in the active mode with the "performance" policy (ie. scaling_governor set to "performance"), to avoid situations in which the kernel appears to discard data passed to it via the EPP sysfs attribute. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2020-09-01 21:14:06 +02:00
Rafael J. Wysocki	c27a0ccc3c	cpufreq: intel_pstate: Update cached EPP in the active mode Make intel_pstate update the cached EPP value when setting the EPP via sysfs in the active mode just like it is the case in the passive mode, for consistency, but also for the benefit of subsequent changes. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2020-09-01 21:13:59 +02:00
Rafael J. Wysocki	43298db300	cpufreq: intel_pstate: Refuse to turn off with HWP enabled After commit `f6ebbcf08f` ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") it is possible to change the driver status to "off" via sysfs with HWP enabled, which effectively causes the driver to unregister itself, but HWP remains active and it forces the minimum performance, so even if another cpufreq driver is loaded, it will not be able to control the CPU frequency. For this reason, make the driver refuse to change the status to "off" with HWP enabled. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2020-09-01 21:13:32 +02:00
Rafael J. Wysocki	f6ebbcf08f	cpufreq: intel_pstate: Implement passive mode with HWP enabled Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that the HWP algorithm is generally allowed to make the CPU run at a frequency above the floor P-state set by intel_pstate in the entire available range of P-states, while without HWP a CPU can run in a P-state above the requested one if the latter falls into the range of turbo P-states (referred to as the turbo range) or if the P-states of all CPUs in one package are coordinated with each other at the hardware level. [Note that in principle the HWP floor may not be taken into account by the processor if it falls into the turbo range, in which case the processor has a license to choose any P-state, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range.] With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2020-08-11 17:29:45 +02:00
Srinivas Pandruvada	4daca379c7	cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0 The MSR_TURBO_RATIO_LIMIT can be 0. This is not an error. User can update this MSR via BIOS settings on some systems or can use msr tools to update. Also some systems boot with value = 0. This results in display of cpufreq/cpuinfo_max_freq wrong. This value will be equal to cpufreq/base_frequency, even though turbo is enabled. But platform will still function normally in HWP mode as we get max 1-core frequency from the MSR_HWP_CAPABILITIES. This MSR is already used to calculate cpu->pstate.turbo_freq, which is used for to set policy->cpuinfo.max_freq. But some other places cpu->pstate.turbo_pstate is used. For example to set policy->max. To fix this, also update cpu->pstate.turbo_pstate when updating cpu->pstate.turbo_freq. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-08-04 12:43:07 +02:00
Rafael J. Wysocki	de002c55ca	cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode Because intel_pstate_set_energy_pref_index() reads and writes the MSR_HWP_REQUEST register without using the cached value of it used by intel_pstate_hwp_boost_up() and intel_pstate_hwp_boost_down(), those functions may overwrite the value written by it and so the EPP value set via sysfs may be lost. To avoid that, make intel_pstate_set_energy_pref_index() take the cached value of MSR_HWP_REQUEST just like the other two routines mentioned above and update it with the new EPP value coming from user space in addition to updating the MSR. Note that the MSR itself still needs to be updated too in case hwp_boost is unset or the boosting mechanism is not active at the EPP change time. Fixes: `e0efd5be63` ("cpufreq: intel_pstate: Add HWP boost utility and sched util hooks") Reported-by: Francisco Jerez <currojerez@riseup.net> Cc: 4.18+ <stable@vger.kernel.org> # 4.18+: 3da97d4db8ee cpufreq: intel_pstate: Rearrange ... Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2020-07-30 18:20:23 +02:00
Rafael J. Wysocki	3a95717606	cpufreq: intel_pstate: Rearrange the storing of new EPP values Move the locking away from intel_pstate_set_energy_pref_index() into its only caller and drop the (now redundant) return_pref label from it. Also move the "raw" EPP value check into the caller of that function, so as to do it before acquiring the mutex, and reduce code duplication related to the "raw" EPP values processing somewhat. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net>	2020-07-30 18:19:52 +02:00
Rafael J. Wysocki	80e3036866	Merge back cpufreq material for v5.9.	2020-07-27 12:34:55 +02:00
Rafael J. Wysocki	7aa1031223	cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported Although there are processors supporting hardware-managed P-states (HWP) without the energy-performance preference (EPP) feature, they are not expected to be run with HWP enabled (the BIOS should disable HWP on those systems). Missing EPP support generally indicates an incomplete HWP implementation and so it is better to avoid using HWP on those systems in production. However, intel_pstate currently enables HWP on such systems, which is questionable, so prevent it from doing that by making it check EPP support before enabling HWP and avoid enabling it if EPP is not supported by the processor at hand. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-16 17:16:51 +02:00
Rafael J. Wysocki	23a522e388	cpufreq: intel_pstate: Clean up aperf_mperf_shift description The kerneldoc description of the aperf_mperf_shift field in struct global_params is unclear and there is a typo in it, so simplify it and clean it up. Reported-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Lee Jones <lee.jones@linaro.org>	2020-07-16 17:15:13 +02:00
Lee Jones	8f23d1f12c	cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift() Fixes the following W=1 kernel build warning(s): drivers/cpufreq/intel_pstate.c:293: warning: Function parameter or member 'get_aperf_mperf_shift' not described in 'pstate_funcs' Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Remove line break ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-15 15:17:06 +02:00
Rafael J. Wysocki	39a188b883	cpufreq: intel_pstate: Fix active mode setting from command line If intel_pstate starts in the passive mode by default (that happens when the processor in the system doesn't support HWP), passing intel_pstate=active in the kernel command line doesn't work, so fix that. Fixes: `33aa46f252` ("cpufreq: intel_pstate: Use passive mode by default without HWP") Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Doug Smythies <dsmythies@telus.net>	2020-07-13 17:55:57 +02:00
Srinivas Pandruvada	3ff79754d7	cpufreq: intel_pstate: Fix static checker warning for epp variable Fix warning for: drivers/cpufreq/intel_pstate.c:731 store_energy_performance_preference() error: uninitialized symbol 'epp'. This warning is for a case, when energy_performance_preference attribute matches pre defined strings. In this case the value of raw epp will not be used to set EPP bits in MSR_HWP_REQUEST. So initializing with any value is fine. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-13 15:50:53 +02:00
Srinivas Pandruvada	f473bf398b	cpufreq: intel_pstate: Allow raw energy performance preference value Currently using attribute "energy_performance_preference", user space can write one of the four per-defined preference string. These preference strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or Energy-Performance Bias (EPB) knob. These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. There are number of cases, where this is not enough. For example: Suppose user wants more performance when connected to AC. Instead of using default "balance performance", the "performance" setting can be used. This changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in electrical and thermal issues on some platforms. This results in aggressive throttling, which causes a drop in performance. But some value between 0x80 and 0x00 results in better performance. But that value can't be fixed as the power curve is not linear. In some cases just changing EPP from 0x80 to 0x75 is enough to get significant performance gain. Similarly on battery the default "balance_performance" mode can be aggressive in power consumption. But picking up the next choice "balance power" results in too much loss of performance, which results in bad user experience in use cases like "Google Hangout". It was observed that some value between these two EPP is optimal. This change allows fine grain EPP tuning for platform like Chromebook or for users who wants to fine tune power and performance. Here based on the product and use cases, different EPP values can be set. This change is similar to the change done for: /sys/devices/system/cpu/cpu*/power/energy_perf_bias where user has choice to write a predefined string or raw value. The change itself is trivial. When user preference doesn't match predefined string preferences and value is an unsigned integer and in range, use that value for EPP. When the EPP feature is not present writing raw value is not supported. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-02 13:02:46 +02:00
Srinivas Pandruvada	ed7bde7a6d	cpufreq: intel_pstate: Allow enable/disable energy efficiency By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-02 13:02:46 +02:00
Srinivas Pandruvada	589bab6bb3	cpufreq: intel_pstate: Add one more OOB control bit Add one more bit for OOB (Out Of Band) enabling of P-states. If OOB handling of P-states is enabled, intel_pstate shouldn't load. Currently, only "BIT(8) == 1" of the MSR MSR_MISC_PWR_MGMT is considered as OOB, but "BIT(18) == 1" needs to be taken into consideration as OOB condition too. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Add an empty code line, edit subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-06-23 17:24:32 +02:00
Rafael J. Wysocki	9795a0ddf8	Merge back cpufreq material for v5.8.	2020-05-02 22:00:56 +02:00
Chris Wilson	8c539776ac	cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once Make a note of the first time we discover the turbo mode has been disabled by the BIOS, as otherwise we complain every time we try to update the mode. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-04-27 10:30:11 +02:00
Rafael J. Wysocki	33aa46f252	cpufreq: intel_pstate: Use passive mode by default without HWP After recent changes allowing scale-invariant utilization to be used on x86, the schedutil governor on top of intel_pstate in the passive mode should be on par with (or better than) the active mode "powersave" algorithm of intel_pstate on systems in which hardware-managed P-states (HWP) are not used, so it should not be necessary to use the internal scaling algorithm in those cases. Accordingly, modify intel_pstate to start in the passive mode by default if the processor at hand does not support HWP of if the driver is requested to avoid using HWP through the kernel command line. Among other things, that will allow utilization clamps and the support for RT/DL tasks in the schedutil governor to be utilized on systems in which intel_pstate is used. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-04-17 17:04:38 +02:00
Linus Torvalds	642e53ead6	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle are: - Various NUMA scheduling updates: harmonize the load-balancer and NUMA placement logic to not work against each other. The intended result is better locality, better utilization and fewer migrations. - Introduce Thermal Pressure tracking and optimizations, to improve task placement on thermally overloaded systems. - Implement frequency invariant scheduler accounting on (some) x86 CPUs. This is done by observing and sampling the 'recent' CPU frequency average at ~tick boundaries. The CPU provides this data via the APERF/MPERF MSRs. This hopefully makes our capacity estimates more precise and keeps tasks on the same CPU better even if it might seem overloaded at a lower momentary frequency. (As usual, turbo mode is a complication that we resolve by observing the maximum frequency and renormalizing to it.) - Add asymmetric CPU capacity wakeup scan to improve capacity utilization on asymmetric topologies. (big.LITTLE systems) - PSI fixes and optimizations. - RT scheduling capacity awareness fixes & improvements. - Optimize the CONFIG_RT_GROUP_SCHED constraints code. - Misc fixes, cleanups and optimizations - see the changelog for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) threads: Update PID limit comment according to futex UAPI change sched/fair: Fix condition of avg_load calculation sched/rt: cpupri_find: Trigger a full search as fallback kthread: Do not preempt current task if it is going to call schedule() sched/fair: Improve spreading of utilization sched: Avoid scale real weight down to zero psi: Move PF_MEMSTALL out of task->flags MAINTAINERS: Add maintenance information for psi psi: Optimize switching tasks inside shared cgroups psi: Fix cpu.pressure for cpu.max and competing cgroups sched/core: Distribute tasks within affinity masks sched/fair: Fix enqueue_task_fair warning thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code sched/rt: Remove unnecessary push for unfit tasks sched/rt: Allow pulling unfitting task sched/rt: Optimize cpupri_find() on non-heterogenous systems sched/rt: Re-instate old behavior in select_task_rq_rt() sched/rt: cpupri_find: Implement fallback mechanism for !fit case sched/fair: Fix reordering of enqueue/dequeue_task_fair() sched/fair: Fix runnable_avg for throttled cfs ...	2020-03-30 17:01:51 -07:00
Linus Torvalds	9b82f05f86	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The main changes in this cycle were: Kernel side changes: - A couple of x86/cpu cleanups and changes were grandfathered in due to patch dependencies. These clean up the set of CPU model/family matching macros with a consistent namespace and C99 initializer style. - A bunch of updates to various low level PMU drivers: * AMD Family 19h L3 uncore PMU * Intel Tiger Lake uncore support * misc fixes to LBR TOS sampling - optprobe fixes - perf/cgroup: optimize cgroup event sched-in processing - misc cleanups and fixes Tooling side changes are to: - perf {annotate,expr,record,report,stat,test} - perl scripting - libapi, libperf and libtraceevent - vendor events on Intel and S390, ARM cs-etm - Intel PT updates - Documentation changes and updates to core facilities - misc cleanups, fixes and other enhancements" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits) cpufreq/intel_pstate: Fix wrong macro conversion x86/cpu: Cleanup the now unused CPU match macros hwrng: via_rng: Convert to new X86 CPU match macros crypto: Convert to new CPU match macros ASoC: Intel: Convert to new X86 CPU match macros powercap/intel_rapl: Convert to new X86 CPU match macros PCI: intel-mid: Convert to new X86 CPU match macros mmc: sdhci-acpi: Convert to new X86 CPU match macros intel_idle: Convert to new X86 CPU match macros extcon: axp288: Convert to new X86 CPU match macros thermal: Convert to new X86 CPU match macros hwmon: Convert to new X86 CPU match macros platform/x86: Convert to new CPU match macros EDAC: Convert to new X86 CPU match macros cpufreq: Convert to new X86 CPU match macros ACPI: Convert to new X86 CPU match macros x86/platform: Convert to new CPU match macros x86/kernel: Convert to new CPU match macros x86/kvm: Convert to new CPU match macros x86/perf/events: Convert to new CPU match macros ...	2020-03-30 16:40:08 -07:00
Rafael J. Wysocki	5ac54113dd	cpufreq: intel_pstate: Simplify intel_pstate_cpu_init() The initial policy value set by intel_pstate_cpu_init() depends on whether or not CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is set, but that is not necessary, because the core will set the policy to "performance" in cpufreq_init_policy() if the default governor is "performance" anyway. Accordingly, change intel_pstate_cpu_init() to always set policy to CPUFREQ_POLICY_POWERSAVE initially to provide a valid fallback value to cpufreq_init_policy() in case the default cpufreq governor is neither "powersave" nor "performance". Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-03-26 20:30:11 +01:00
Thomas Gleixner	d97828072d	cpufreq/intel_pstate: Fix wrong macro conversion The feature flag hwp_support_ids are supposed to match on is X86_FEATURE_HWP, not X86_FEATURE_APERFMPERF. Fix it. [ bp: Write commit message. ] Fixes: `b11d77fa30` ("cpufreq: Convert to new X86 CPU match macros") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200324060124.GC11705@shao2-debian	2020-03-25 13:21:55 +01:00
Thomas Gleixner	b11d77fa30	cpufreq: Convert to new X86 CPU match macros The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Get rid the of most local macro wrappers for consistency. The ones which make sense for readability are renamed to X86_MATCH*. In the centrino driver this also removes the two extra duplicates of family 6 model 13 which have no value at all. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/87eetheu88.fsf@nanos.tec.linutronix.de	2020-03-24 21:31:27 +01:00
Rafael J. Wysocki	d5a2a6bb27	cpufreq: intel_pstate: Consolidate policy verification There is still some code duplication between intel_pstate_verify_policy() and intel_cpufreq_verify_policy(), so avoid it by moving the common code into a separate function and calling it from both these places. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-03-14 11:44:56 +01:00
Ingo Molnar	546121b65f	Linux 5.6-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl5TFjYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGikYIAIhI4C8R87wyj/0m b2NWk6TZ5AFmiZLYSbsPYxdSC9OLdUmlGFKgL2SyLTwZCiHChm+cNBrngp3hJ6gz x1YH99HdjzkiaLa0hCc2+a/aOt8azGU2RiWEP8rbo0gFSk28wE6FjtzSxR95jyPz FRKo/sM+dHBMFXrthJbr+xHZ1De28MITzS2ddstr/10ojoRgm43I3qo1JKhjoDN5 9GGb6v0Md5eo+XZjjB50CvgF5GhpiqW7+HBB7npMsgTk37GdsR5RlosJ/TScLVC9 dNeanuqk8bqMGM0u2DFYdDqjcqAlYbt8aobuWWCB5xgPBXr5G2nox+IgF/f9G6UH EShA/xs= =OFPc -----END PGP SIGNATURE----- Merge tag 'v5.6-rc3' into sched/core, to pick up fixes and dependent patches Signed-off-by: Ingo Molnar <mingo@kernel.org>	2020-02-24 11:36:09 +01:00
Giovanni Gherdovich	918229cdd5	x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance On some platforms such as the Dell XPS 13 laptop the firmware disables turbo when the machine is disconnected from AC, and viceversa it enables it again when it's reconnected. In these cases a _PPC ACPI notification is issued. The scheduler needs to know freq_max for frequency-invariant calculations. To account for turbo availability to come and go, record freq_max at boot as if turbo was available and store it in a helper variable. Use a setter function to swap between freq_base and freq_max every time turbo goes off or on. Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz	2020-01-28 21:37:06 +01:00
Rafael J. Wysocki	1e4f63aecb	cpufreq: Avoid creating excessively large stack frames In the process of modifying a cpufreq policy, the cpufreq core makes a copy of it including all of the internals which is stored on the CPU stack. Because struct cpufreq_policy is relatively large, this may cause the size of the stack frame to exceed the 2 KB limit and so the GCC complains when -Wframe-larger-than= is used. In fact, it is not necessary to copy the entire policy structure in order to modify it, however. First, because cpufreq_set_policy() obtains the min and max policy limits from frequency QoS now, it is not necessary to pass the limits to it from the callers. The only things that need to be passed to it from there are the new governor pointer or (if there is a built-in governor in the driver) the "policy" value representing the governor choice. They both can be passed as individual arguments, though, so make cpufreq_set_policy() take them this way and rework its callers accordingly. This avoids making copies of cpufreq policies in the callers of cpufreq_set_policy(). Second, cpufreq_set_policy() still needs to pass the new policy data to the ->verify() callback of the cpufreq driver whose task is to sanitize the min and max policy limits. It still does not need to make a full copy of struct cpufreq_policy for this purpose, but it needs to pass a few items from it to the driver in case they are needed (different drivers have different needs in that respect and all of them have to be covered). For this reason, introduce struct cpufreq_policy_data to hold copies of the members of struct cpufreq_policy used by the existing ->verify() driver callbacks and pass a pointer to a temporary structure of that type to ->verify() (instead of passing a pointer to full struct cpufreq_policy to it). While at it, notice that intel_pstate and longrun don't really need to verify the "policy" value in struct cpufreq_policy, so drop those check from them to avoid copying "policy" into struct cpufreq_policy_data (which allows it to be slightly smaller). Also while at it fix up white space in a couple of places and make cpufreq_set_policy() static (as it can be so). Fixes: `3000ce3c52` ("cpufreq: Use per-policy frequency QoS") Link: https://lore.kernel.org/linux-pm/CAMuHMdX6-jb1W8uC2_237m8ctCpsnGp=JCxqt8pCWVqNXHmkVg@mail.gmail.com Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2020-01-27 10:33:33 +01:00
Harry Pan	731e6b9753	cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether" Fix a spelling typo in the comment, no function change. Signed-off-by: Harry Pan <harry.pan@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-01-13 11:45:20 +01:00

1 2 3 4 5 ...

417 Commits