OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	6ef746769e	More power management updates for 4.20-rc1 - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/ fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3 A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88 AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn 1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf s8MUK4bWVhViVXGS5bMP =/bnR -----END PGP SIGNATURE----- Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu	2018-10-30 09:08:07 -07:00
Johannes Weiner	8508cf3ffa	sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-10-26 16:26:32 -07:00
Daniel Lezcano	a7fe5190c0	cpuidle: menu: Remove get_loadavg() from the performance multiplier The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit `372ba8cb46` (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long nr_waiters, unsigned long load) { struct rq rq = this_rq(); nr_waiters = atomic_read(&rq->nr_iowait); load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] `4dedd9f124` Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-25 16:49:27 +02:00
Rafael J. Wysocki	f1c8e410cd	cpuidle: menu: Avoid computations when result will be discarded If the minimum interval taken into account in the average computation loop in get_typical_interval() is less than the expected idle duration determined so far, the resultant average cannot be greater than that value as well and the entire return result of the function is going to be discarded anyway going forward. In that case, it is a waste of time to carry out the remaining computations in get_typical_interval(), so avoid that by returning early if the minimum interval is not below the expected idle duration. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-18 09:34:13 +02:00
Rafael J. Wysocki	12b65eadf0	cpuidle: menu: Drop redundant comparison Since the correction factor cannot be greater than RESOLUTION * DECAY, the result of the predicted_us computation in menu_select() cannot be greater than data->next_timer_us, so it is not necessary to compare the "typical interval" value coming from get_typical_interval() with data->next_timer_us separately. It is sufficient to copmare predicted_us with the return value of get_typical_interval() directly, so do that and drop the now redundant expected_interval variable. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-18 09:34:13 +02:00
Rafael J. Wysocki	bde091ece2	cpuidle: menu: Simplify checks related to the polling state After some recent menu governor changes, the promotion of the "polling" state to a physical one is mostly controlled by the latency limit (resulting from the "interactivity" factor) and not by the time to the closest timer event, so it should be sufficient to check the exit latency of that state for this purpose (of course, its target residency still needs to be within the next timer event range for energy-efficiency). Also, the physical state the "polling" one is promoted to need not be the next one in principle (in case the next state is disabled, for example). For these reasons, simplify the checks made to decide whether or not to promote the "polling" state to a physical one and update the target idle duration when it is promoted in case the residency of the new state turns out to be above the tick boundary (in which case there is no reason to stop the tick). Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-10-12 10:46:37 +02:00
Rafael J. Wysocki	53812cdc91	cpuidle: menu: Move the latency_req == 0 special case check It is better to always update data->bucket before returning from menu_select() to avoid updating the correction factor for a stale bucket, so combine the latency_req == 0 special check with the more general check below. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-10-04 19:27:27 +02:00
Rafael J. Wysocki	8b007ebec9	cpuidle: menu: Avoid computations for very close timers If the next timer event (not including the tick) is closer than the target residency of the second state or the PM QoS latency constraint is below its exit latency, state[0] will be used regardless of any other factors, so skip the computations in menu_select() then and return 0 straight away from it. Still, do that after the bucket has been determined to avoid updating the correction factor for a stale bucket. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-10-04 19:27:27 +02:00
Rafael J. Wysocki	eb40a380bf	cpuidle: menu: Do not update last_state_idx in menu_select() It is not necessary to update data->last_state_idx in menu_select() as it only is used in menu_update() which only runs when data->needs_update is set and that is set only when updating data->last_state_idx in menu_reflect(). Accordingly, drop the update of data->last_state_idx from menu_select() and get rid of the (now redundant) "out" label from it. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:26:38 +02:00
Rafael J. Wysocki	96c3d11df1	cpuidle: menu: Get rid of first_idx from menu_select() Rearrange the code in menu_select() so that the loop over idle states always starts from 0 and get rid of the first_idx variable. While at it, add two empty lines to separate conditional statements from one another. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:25:53 +02:00
Rafael J. Wysocki	23e8ceb9ce	cpuidle: menu: Compute first_idx when latency_req is known Since menu_select() can only set first_idx to 1 if the exit latency of the second state is not greater than the latency limit, it should first determine that limit. Thus first_idx should be computed after the "interactivity" factor has been taken into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewedy-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 19:24:14 +02:00
Rafael J. Wysocki	5f26bdceb9	cpuidle: menu: Fix wakeup statistics updates for polling state If the CPU exits the "polling" state due to the time limit in the loop in poll_idle(), this is not a real wakeup and it just means that the "polling" state selection was not adequate. The governor mispredicted short idle duration, but had a more suitable state been selected, the CPU might have spent more time in it. In fact, there is no reason to expect that there would have been a wakeup event earlier than the next timer in that case. Handling such cases as regular wakeups in menu_update() may cause the menu governor to make suboptimal decisions going forward, but ignoring them altogether would not be correct either, because every time menu_select() is invoked, it makes a separate new attempt to predict the idle duration taking distinct time to the closest timer event as input and the outcomes of all those attempts should be recorded. For this reason, make menu_update() always assume that if the "polling" state was exited due to the time limit, the next proper wakeup event for the CPU would be the next timer event (not including the tick). Fixes: `a37b969a61` "cpuidle: poll_state: Add time limit to poll_idle()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-04 10:23:37 +02:00
Rafael J. Wysocki	03dba27804	cpuidle: menu: Replace data->predicted_us with local variable The predicted_us field in struct menu_device is only accessed in menu_select(), so replace it with a local variable in that function. With that, stop using expected_interval instead of predicted_us to store the new predicted idle duration value if it is set to the selected state's target residency which is quite confusing. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2018-10-03 12:02:44 +02:00
Fieah Lim	6a5f95b5a4	cpuidle: Remove unnecessary wrapper cpuidle_get_last_residency() cpuidle_get_last_residency() is just a wrapper for retrieving the last_residency member of struct cpuidle_device. It's also weirdly the only wrapper function for accessing cpuidle_* struct member (by my best guess is it could be a leftover from v2.x). Anyhow, since the only two users (the ladder and menu governors) can access dev->last_residency directly, and it's more intuitive to do it that way, let's just get rid of the wrapper. This patch tidies up CPU idle code a bit without functional changes. Signed-off-by: Fieah Lim <kw@fieahl.im> [ rjw: Changelog cleanup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-09-18 09:24:44 +02:00
Rafael J. Wysocki	757ab15c3f	cpuidle: menu: Retain tick when shallow state is selected The case addressed by commit `5ef499cd57` (cpuidle: menu: Handle stopped tick more aggressively) in the stopped tick case is present when the tick has not been stopped yet too. Namely, if only two CPU idle states, shallow state A with target residency significantly below the tick boundary and deep state B with target residency significantly above it, are available and the predicted idle duration is above the tick boundary, but below the target residency of state B, state A will be selected and the CPU may spend indefinite amount of time in it, which is not quite energy-efficient. However, if the tick has not been stopped yet and the governor is about to select a shallow idle state for the CPU even though the idle duration predicted by it is above the tick boundary, it should be fine to wake up the CPU early, so the tick can be retained then and the governor will have a chance to select a deeper state when it runs next time. [Note that when this really happens, it will make the idle duration predictor believe that the CPU might be idle longer than predicted, which will make it more likely to predict longer idle durations going forward, but that will also cause deeper idle states to be selected going forward, on average, which is what's needed here.] Fixes: `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+: `5ef499cd57` (cpuidle: menu: Handle ...) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-25 13:16:08 +02:00
Rafael J. Wysocki	5ef499cd57	cpuidle: menu: Handle stopped tick more aggressively Commit `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) missed the case when the target residencies of deep idle states of CPUs are above the tick boundary which may cause the CPU to get stuck in a shallow idle state for a long time. Say there are two CPU idle states available: one shallow, with the target residency much below the tick boundary and one deep, with the target residency significantly above the tick boundary. In that case, if the tick has been stopped already and the expected next timer event is relatively far in the future, the governor will assume the idle duration to be equal to TICK_USEC and it will select the idle state for the CPU accordingly. However, that will cause the shallow state to be selected even though it would have been more energy-efficient to select the deep one. To address this issue, modify the governor to always use the time till the closest timer event instead of the predicted idle duration if the latter is less than the tick period length and the tick has been stopped already. Also make it extend the search for a matching idle state if the tick is stopped to avoid settling on a shallow state if deep states with target residencies above the tick period length are available. In addition, make it always indicate that the tick should be stopped if it has been stopped already for consistency. Fixes: `87c9fe6ee4` (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-20 13:37:03 +02:00
Rafael J. Wysocki	50f7ccc647	cpuidle: menu: Update stale polling override comment The comment to explain why the menu governor uses idle state 1 instead of idle state 0 as the first one sometimes is stale (among other things it mentions a user setting not present any more), so update it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-16 23:05:43 +02:00
Rafael J. Wysocki	f390c5eb28	cpuidle: menu: Fix white space Fix some damaged white space in menu_select(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-08-15 00:08:51 +02:00
Rafael J. Wysocki	0fc784fb09	cpuidle: governors: Consolidate PM QoS handling There is some code duplication related to the PM QoS handling between the existing cpuidle governors, so move that code to a common helper function and call that from the governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-05-30 23:13:00 +02:00
Rafael J. Wysocki	cf7eeea947	cpuidle: governors: Drop redundant checks related to PM QoS PM_QOS_RESUME_LATENCY_NO_CONSTRAINT is defined as the 32-bit integer maximum, so it is not necessary to test the return value of dev_pm_qos_raw_read_value() against it directly in the menu and ladder cpuidle governors. Drop these redundant checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2018-05-30 23:13:00 +02:00
Rafael J. Wysocki	87c9fe6ee4	cpuidle: menu: Avoid selecting shallow states with stopped tick If the scheduler tick has been stopped already and the governor selects a shallow idle state, the CPU can spend a long time in that state if the selection is based on an inaccurate prediction of idle time. That effect turns out to be relevant, so it needs to be mitigated. To that end, modify the menu governor to discard the result of the idle time prediction if the tick is stopped and the predicted idle time is less than the tick period length, unless the tick timer is going to expire soon. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-04-09 11:54:57 +02:00
Rafael J. Wysocki	296bb1e51a	cpuidle: menu: Refine idle state selection for running tick If the tick isn't stopped, the target residency of the state selected by the menu governor may be greater than the actual time to the next tick and that means lost energy. To avoid that, make tick_nohz_get_sleep_length() return the current time to the next event (before stopping the tick) in addition to the estimated one via an extra pointer argument and make menu_select() use that value to refine the state selection when necessary. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-04-09 11:54:56 +02:00
Rafael J. Wysocki	45f1ff59e2	cpuidle: Return nohz hint from cpuidle_select() Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, or (2) the selected state is a polling one, or (3) the expected idle period duration is within the tick period range. In addition to that, the correction factor computations in the menu governor need to take the possibility that the tick may not be stopped into account to avoid artificially small correction factor values. To that end, add a mechanism to record tick wakeups, as suggested by Peter Zijlstra, and use it to modify the menu_update() behavior when tick wakeup occurs. Namely, if the CPU is woken up by the tick and the return value of tick_nohz_get_sleep_length() is not within the tick boundary, the predicted idle duration is likely too short, so make menu_update() try to compensate for that by updating the governor statistics as though the CPU was idle for a long time. Since the value returned through the new argument pointer of cpuidle_select() is not used by its caller yet, this change by itself is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2018-04-06 09:29:34 +02:00
Ramesh Thomas	c523c68da2	cpuidle: ladder: Add per CPU PM QoS resume latency support Individual CPUs may have special requirements to not enter deep idle states. For example, a CPU running real time applications would not want to enter deep idle states to avoid latency impacts. At the same time other CPUs that do not have such a requirement could allow deep idle states to save power. This was already implemented in the menu governor. Implementing similar changes in the ladder governor which gets selected when CONFIG_NO_HZ and CONFIG_NO_HZ_IDLE are not set. Refer following commits for the menu governor changes. Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-11-08 12:16:01 +01:00
Rafael J. Wysocki	0759e80b84	PM / QoS: Fix device resume latency framework The special value of 0 for device resume latency PM QoS means "no restriction", but there are two problems with that. First, device resume latency PM QoS requests with 0 as the value are always put in front of requests with positive values in the priority lists used internally by the PM QoS framework, causing 0 to be chosen as an effective constraint value. However, that 0 is then interpreted as "no restriction" effectively overriding the other requests with specific restrictions which is incorrect. Second, the users of device resume latency PM QoS have no way to specify that any resume latency at all should be avoided, which is an artificial limitation in general. To address these issues, modify device resume latency PM QoS to use S32_MAX as the "no constraint" value and 0 as the "no latency at all" one and rework its users (the cpuidle menu governor, the genpd QoS governor and the runtime PM framework) to follow these changes. Also add a special "n/a" value to the corresponding user space I/F to allow user space to indicate that it cannot accept any resume latencies at all for the given device. Fixes: `85dc0b8a40` (PM / QoS: Make it possible to expose PM QoS latency constraints) Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323 Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Tero Kristo <t-kristo@ti.com> Reviewed-by: Ramesh Thomas <ramesh.thomas@intel.com>	2017-11-08 12:14:51 +01:00
Rafael J. Wysocki	dc2251bf98	cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol On some architectures the first (index 0) idle state is a polling one and it doesn't really save energy, so there is the CPUIDLE_DRIVER_STATE_START symbol allowing some pieces of cpuidle code to avoid using that state. However, this makes the code rather hard to follow. It is better to explicitly avoid the polling state, so add a new cpuidle state flag CPUIDLE_FLAG_POLLING to mark it and make the relevant code check that flag for the first state instead of using the CPUIDLE_DRIVER_STATE_START symbol. In the ACPI processor driver that cannot always rely on the state flags (like before the states table has been set up) define a new internal symbol ACPI_IDLE_STATE_START equivalent to the CPUIDLE_DRIVER_STATE_START one and drop the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>	2017-08-30 03:05:29 +02:00
Nicholas Piggin	3ed09c9458	cpuidle: menu: allow state 0 to be disabled The menu driver does not allow state0 to be disabled completely. If it is disabled but other enabled states don't meet latency requirements, it is still used. Fix this by starting with the first enabled idle state. Fall back to state 0 if no idle states are enabled (arguably this should be -EINVAL if it is attempted, but this is the minimal fix). Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-06-29 22:59:17 +02:00
Linus Torvalds	1827adb11a	Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...	2017-03-03 10:16:38 -08:00
Ingo Molnar	03441a3482	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/stat.h> We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/stat.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:34 +01:00
Ingo Molnar	4f17722c72	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h> We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which will have to be picked up from a couple of .c files. Create a trivial placeholder <linux/sched/topology.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:27 +01:00
Rafael J. Wysocki	6dbf5cea05	cpuidle: menu: Avoid taking spinlock for accessing QoS values After commit `9908859aca` (cpuidle/menu: add per CPU PM QoS resume latency consideration) the cpuidle menu governor calls dev_pm_qos_read_value() on CPU devices to read the current resume latency QoS constraint values for them. That function takes a spinlock to prevent the device's power.qos pointer from becoming NULL during the access which is a problem for the RT patchset where spinlocks are converted into mutexes and the idle loop stops working. However, it is not even necessary for the menu governor to take that spinlock, because the power.qos pointer accessed under it cannot be modified during the access anyway. For this reason, introduce a "raw" routine for accessing device QoS resume latency constraints without locking and use it in the menu governor. Fixes: `9908859aca` (cpuidle/menu: add per CPU PM QoS resume latency consideration) Acked-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-27 15:07:38 +01:00
Alex Shi	9908859aca	cpuidle/menu: add per CPU PM QoS resume latency consideration There may be special requirements on CPU response time, like if a interrupt is pinned to a CPU, that CPU should not go into excessively deep idle states. For this reason, add a mechanism for adding PM QoS resume latency constraints for individual CPUs and modify the menu governor to take them into account. To that end, extend the device PM QoS pm_qos_resume_latency attribute to CPUs, which is possible, because the exit latency for CPUs is effectively equivalent to the resume latency for devices. Signed-off-by: Alex Shi <alex.shi@linaro.org> Acked-by: Rik van Riel <riel@redhat.com> [ rjw : Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-01-30 11:03:32 +01:00
Alex Shi	8e37e1a2a3	cpuidle/menu: stop seeking deeper idle if current state is deep enough Obsolete commit `71abbbf856` (cpuidle: extend cpuidle and menu governor to handle dynamic states) wanted to introduce dynamic C-states, but that idea was dropped long ago. The nonsense deeper C-state checking remained, though. Since both target_residency and exit_latency are longer for deeper idle state, there's no need to waste CPU time on useless checks. Signed-off-by: Alex Shi <alex.shi@linaro.org> Acked-by: Rik van Riel <riel@redhat.com> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-01-30 10:56:00 +01:00
Linus Torvalds	7c0f6ba682	Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]#[[:blank:]]include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"\|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-24 11:46:01 -08:00
Daniel Lezcano	e5f1b24587	cpuidle: governors: Remove remaining old module code The governor's code use try_module_get() and put_module() to refcount the governor's module. But the governors are not compiled as module. The refcount does not prevent to switch the governor or unload a module as they aren't compiled as modules. The code is pointless, so remove it. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-21 14:49:51 +02:00
Rafael J. Wysocki	0c313cb207	cpuidle: menu: Fall back to polling if next timer event is near Commit `a9ceb78bc7` (cpuidle,menu: use interactivity_req to disable polling) changed the behavior of the fallback state selection part of menu_select() so it looks at interactivity_req instead of data->next_timer_us when it makes its decision. That effectively caused polling to be used more often as fallback idle which led to significant increases of energy consumption in some cases. Commit `e132b9b3bc` (cpuidle: menu: use high confidence factors only when considering polling) changed that logic again to be more predictable, but that didn't help with the increased energy consumption problem. For this reason, go back to making decisions on which state to fall back to based on data->next_timer_us which is the time we know for sure something will happen rather than a prediction (which may be inaccurate and turns out to be so often enough to be problematic). However, take the target residency of the first proper idle state (C1) into account, so that state is not used as the fallback one if its target residency is greater than data->next_timer_us. Fixes: `a9ceb78bc7` (cpuidle,menu: use interactivity_req to disable polling) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>	2016-03-21 15:50:28 +01:00
Rik van Riel	e132b9b3bc	cpuidle: menu: use high confidence factors only when considering polling The menu governor uses five different factors to pick the idle state: - the user configured latency_req - the time until the next timer (next_timer_us) - the typical sleep interval, as measured recently - an estimate of sleep time by dividing next_timer_us by an observed factor - a load corrected version of the above, divided again by load Only the first three items are known with enough confidence that we can use them to consider polling, instead of an actual CPU idle state, because the cost of being wrong about polling can be excessive power use. The latter two are used in the menu governor's main selection loop, and can result in choosing a shallower idle state when the system is expected to be busy again soon. This pushes a busy system in the "performance" direction of the performance<>power tradeoff, when choosing between idle states, but stays more strictly on the "power" state when deciding between polling and C1. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-17 02:40:32 +01:00
Rasmus Villemoes	3b99669b75	cpuidle: menu: help gcc generate slightly better code We know that the avg variable actually ends up holding a 32 bit quantity, since it's an average of such numbers. It is only a u64 because it is temporarily used to hold the sum. Making it an actual u32 allows gcc to generate slightly better code, e.g. when computing the square, it can do a 32x32->64 multiply. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-02-17 00:28:15 +01:00
Rasmus Villemoes	7024b18ca4	cpuidle: menu: avoid expensive square root computation Computing the integer square root is a rather expensive operation, at least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64 bit platforms, doing an extra comparison to a constant (variance <= U64_MAX/36). On 64 bit platforms, this does mean that we add a restriction on the range of the variance where we end up using the estimate (since previously the stddev <= ULONG_MAX was a tautology), but on the other hand, we extend the range quite substantially on 32 bit platforms - in both cases, we now allow standard deviations up to 715 seconds, which is for example guaranteed if all observations are less than 1430 seconds. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-02-17 00:27:16 +01:00
Rafael J. Wysocki	5bb1729cbd	cpuidle: menu: Avoid pointless checks in menu_select() If menu_select() cannot find a suitable state to return, it will return the state index stored in data->last_state_idx. This means that it is pointless to look at the states whose indices are less than or equal to data->last_state_idx in the main loop, so don't do that. Given that those checks are done on every idle state selection, this change can save quite a bit of completely unnecessary overhead. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sudeep Holla <sudeep.holla@arm.com>	2016-01-19 15:28:23 +01:00
Jean Delvare	66a5f6b639	cpuidle: Default to ladder governor on ticking systems The menu governor is currently the default on all systems. However the documentation claims that the ladder governor is preferred on ticking systems. So bump the rating of the ladder governor when NO_HZ is disabled, or when booting with nohz=off. This fixes the first half of kernel BZ #65531. Link: https://bugzilla.kernel.org/show_bug.cgi?id=65531 Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-01-15 22:36:19 +01:00
Rafael J. Wysocki	9c4b2867ed	cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0 Commit `a9ceb78bc7` (cpuidle,menu: use interactivity_req to disable polling) exposed a bug in menu_select() causing it to return -1 on systems with CPUIDLE_DRIVER_STATE_START equal to zero, although it should have returned 0. As a result, idle states are not entered by CPUs on those systems. Namely, on the systems in question data->last_state_idx is initially equal to -1 and the above commit modified the condition that would have caused it to be changed to 0 to be less likely to trigger which exposed the problem. However, setting data->last_state_idx initially to -1 doesn't make sense at all and on the affected systems it should always be set to CPUIDLE_DRIVER_STATE_START (ie. 0) unconditionally, so make that happen. Fixes: `a9ceb78bc7` (cpuidle,menu: use interactivity_req to disable polling) Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-01-14 23:24:22 +01:00
Rik van Riel	efddfd90fb	cpuidle,menu: smooth out measured_us calculation The cpuidle state tables contain the maximum exit latency for each cpuidle state. On x86, that is the exit latency for when the entire package goes into that same idle state. However, a lot of the time we only go into the core idle state, not the package idle state. This means we see a much smaller exit latency. We have no way to detect whether we went into the core or package idle state while idle, and that is ok. However, the current menu_update logic does have the potential to trip up the repeating pattern detection in get_typical_interval. If the system is experiencing an exit latency near the idle state's exit latency, some of the samples will have exit_us subtracted, while others will not. This turns a repeating pattern into mush, potentially breaking get_typical_interval. Furthermore, for smaller sleep intervals, we know the chance that all the cores in the package went to the same idle state are fairly small. Dividing the measured_us by two, instead of subtracting the full exit latency when hitting a small measured_us, will reduce the error. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-17 02:24:25 +01:00
Rik van Riel	a9ceb78bc7	cpuidle,menu: use interactivity_req to disable polling The menu governor carefully figures out how much time we typically sleep for an estimated sleep interval, or whether there is a repeating pattern going on, and corrects that estimate for the CPU load. Then it proceeds to ignore that information when determining whether or not to consider polling. This is not a big deal on most x86 CPUs, which have very low C1 latencies, and the patch should not have any effect on those CPUs. However, certain CPUs (eg. Atom) have much higher C1 latencies, and it would be good to not waste performance and power on those CPUs if we are expecting a very low wakeup latency. Disable polling based on the estimated interactivity requirement, not on the time to the next timer interrupt. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-17 02:24:25 +01:00
Rik van Riel	7884084f3b	cpuidle,x86: increase forced cut-off for polling to 20us The cpuidle menu governor has a forced cut-off for polling at 5us, in order to deal with firmware that gives the OS bad information on cpuidle states, leading to the system spending way too much time in polling. However, at least one x86 CPU family (Atom) has chips that have a 20us break-even point for C1. Forcing the polling cut-off to less than that wastes performance and power. Increase the polling cut-off to 20us. Systems with a lower C1 latency will be found in the states table by the menu governor, which will pick those states as appropriate. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-17 02:24:24 +01:00
Rafael J. Wysocki	a802ea9645	cpuidle: Check the sign of index in cpuidle_reflect() Avoid calling the governor's ->reflect method if the state index passed to cpuidle_reflect() is negative. This allows the analogous check to be dropped from menu_reflect(), so do that too, and ensures that arbitrary error codes can be passed to cpuidle_reflect() as the index with no adverse consequences. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2015-05-04 22:53:28 +02:00
Javi Merino	ee3c86f356	cpuidle: menu: use DIV_ROUND_CLOSEST_ULL() Now that the kernel provides DIV_ROUND_CLOSEST_ULL(), drop the internal implementation and use the kernel one. Signed-off-by: Javi Merino <javi.merino@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-04-17 09:03:55 -04:00
Len Brown	b73026b9c9	cpuidle: ladder: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID When the ladder governor sees the CPUIDLE_FLAG_TIME_INVALID flag, it unconditionally causes a state promotion by setting last_residency to a number higher than the state's promotion_time: last_residency = last_state->threshold.promotion_time + 1 It does this for fear that cpuidle_get_last_residency() will be in-accurate, because cpuidle_enter_state() invoked a state with CPUIDLE_FLAG_TIME_INVALID. But the only state with CPUIDLE_FLAG_TIME_INVALID is acpi_safe_halt(), which may return well after its actual idle duration because it enables interrupts, so cpuidle_enter_state() also measures interrupt service time. So what? In ladder, a huge invalid last_residency has exactly the same effect as the current code -- it unconditionally causes a state promotion. In the case where the idle residency plus measured interrupt handling time is less than the state's demotion_time -- we should use that timestamp to give ladder a chance to demote, rather than unconditionally promoting. This can be done by simply ignoring the CPUIDLE_FLAG_TIME_INVALID, and using the "invalid" time, as it is either equal to what we are doing today, or better. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2014-12-17 02:26:28 +01:00
Len Brown	4108b3d962	cpuidle: menu: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID When menu sees CPUIDLE_FLAG_TIME_INVALID, it ignores its timestamps, and assumes that idle lasted as long as the time till next predicted timer expiration. But if an interrupt was seen and serviced before that duration, it would actually be more accurate to use the measured time rather than rounding up to the next predicted timer expiration. And if an interrupt is seen and serviced such that the mesured time exceeds the time till next predicted timer expiration, then truncating to that expiration is the right thing to do -- since we can never stay idle past that timer expiration. So the code can do a better job without checking for CPUIDLE_FLAG_TIME_INVALID. Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2014-12-17 02:26:28 +01:00
Daniel Lezcano	b82b6cca48	cpuidle: Invert CPUIDLE_FLAG_TIME_VALID logic The only place where the time is invalid is when the ACPI_CSTATE_FFH entry method is not set. Otherwise for all the drivers, the time can be correctly measured. Instead of duplicating the CPUIDLE_FLAG_TIME_VALID flag in all the drivers for all the states, just invert the logic by replacing it by the flag CPUIDLE_FLAG_TIME_INVALID, hence we can set this flag only for the acpi idle driver, remove the former flag from all the drivers and invert the logic with this flag in the different governor. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2014-11-12 21:17:27 +01:00

1 2 3

116 Commits