Merge branch 'android-4.14-eas-integration' into experimental-android-4.14 Refactored EAS patches on top of 4.14 base Change-Id: I13a68ddad54ad49c7fec77ad6dec1ec5325aae27
diff --git a/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt b/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt new file mode 100644 index 0000000..2ceb202 --- /dev/null +++ b/Documentation/devicetree/bindings/scheduler/sched-energy-costs.txt
@@ -0,0 +1,378 @@ +=========================================================== +Energy cost bindings for Energy Aware Scheduling +=========================================================== + +=========================================================== +1 - Introduction +=========================================================== + +This note specifies bindings required for energy-aware scheduling +(EAS)[1]. Historically, the scheduler's primary objective has been +performance. EAS aims to provide an alternative objective - energy +efficiency. EAS relies on a simple platform energy cost model to +guide scheduling decisions. The model only considers the CPU +subsystem. + +This note is aligned with the definition of the layout of physical +CPUs in the system as described in the ARM topology binding +description [2]. The concept is applicable to any system so long as +the cost model data is provided for those processing elements in +that system's topology that EAS is required to service. + +Processing elements refer to hardware threads, CPUs and clusters of +related CPUs in increasing order of hierarchy. + +EAS requires two key cost metrics - busy costs and idle costs. Busy +costs comprise of a list of compute capacities for the processing +element in question and the corresponding power consumption at that +capacity. Idle costs comprise of a list of power consumption values +for each idle state [C-state] that the processing element supports. +For a detailed description of these metrics, their derivation and +their use see [3]. + +These cost metrics are required for processing elements in all +scheduling domain levels that EAS is required to service. + +=========================================================== +2 - energy-costs node +=========================================================== + +Energy costs for the processing elements in scheduling domains that +EAS is required to service are defined in the energy-costs node +which acts as a container for the actual per processing element cost +nodes. A single energy-costs node is required for a given system. + +- energy-costs node + + Usage: Required + + Description: The energy-costs node is a container node and + it's sub-nodes describe costs for each processing element at + all scheduling domain levels that EAS is required to + service. + + Node name must be "energy-costs". + + The energy-costs node's parent node must be the cpus node. + + The energy-costs node's child nodes can be: + + - one or more cost nodes. + + Any other configuration is considered invalid. + +The energy-costs node can only contain a single type of child node +whose bindings are described in paragraph 4. + +=========================================================== +3 - energy-costs node child nodes naming convention +=========================================================== + +energy-costs child nodes must follow a naming convention where the +node name must be "thread-costN", "core-costN", "cluster-costN" +depending on whether the costs in the node are for a thread, core or +cluster. N (where N = {0, 1, ...}) is the node number and has no +bearing to the OS' logical thread, core or cluster index. + +=========================================================== +4 - cost node bindings +=========================================================== + +Bindings for cost nodes are defined as follows: + +- system-cost node + + Description: Optional. Must be declared within an energy-costs + node. A system should contain no more than one system-cost node. + + Systems with no modelled system cost should not provide this + node. + + The system-cost node name must be "system-costN" as + described in 3 above. + + A system-cost node must be a leaf node with no children. + + Properties for system-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +- cluster-cost node + + Description: must be declared within an energy-costs node. A + system can contain multiple clusters and each cluster + serviced by EAS must have a corresponding cluster-costs + node. + + The cluster-cost node name must be "cluster-costN" as + described in 3 above. + + A cluster-cost node must be a leaf node with no children. + + Properties for cluster-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +- core-cost node + + Description: must be declared within an energy-costs node. A + system can contain multiple cores and each core serviced by + EAS must have a corresponding core-cost node. + + The core-cost node name must be "core-costN" as described in + 3 above. + + A core-cost node must be a leaf node with no children. + + Properties for core-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +- thread-cost node + + Description: must be declared within an energy-costs node. A + system can contain cores with multiple hardware threads and + each thread serviced by EAS must have a corresponding + thread-cost node. + + The core-cost node name must be "core-costN" as described in + 3 above. + + A core-cost node must be a leaf node with no children. + + Properties for thread-cost nodes are described in paragraph + 5 below. + + Any other configuration is considered invalid. + +=========================================================== +5 - Cost node properties +========================================================== + +All cost node types must have only the following properties: + +- busy-cost-data + + Usage: required + Value type: An array of 2-item tuples. Each item is of type + u32. + Definition: The first item in the tuple is the capacity + value as described in [3]. The second item in the tuple is + the energy cost value as described in [3]. + +- idle-cost-data + + Usage: required + Value type: An array of 1-item tuples. The item is of type + u32. + Definition: The item in the tuple is the energy cost value + as described in [3]. + +=========================================================== +4 - Extensions to the cpu node +=========================================================== + +The cpu node is extended with a property that establishes the +connection between the processing element represented by the cpu +node and the cost-nodes associated with this processing element. + +The connection is expressed in line with the topological hierarchy +that this processing element belongs to starting with the level in +the hierarchy that this processing element itself belongs to through +to the highest level that EAS is required to service. The +connection cannot be sparse and must be contiguous from the +processing element's level through to the highest desired level. The +highest desired level must be the same for all processing elements. + +Example: Given that a cpu node may represent a thread that is a part +of a core, this property may contain multiple elements which +associate the thread with cost nodes describing the costs for the +thread itself, the core the thread belongs to, the cluster the core +belongs to and so on. The elements must be ordered from the lowest +level nodes to the highest desired level that EAS must service. The +highest desired level must be the same for all cpu nodes. The +elements must not be sparse: there must be elements for the current +thread, the next level of hierarchy (core) and so on without any +'holes'. + +Example: Given that a cpu node may represent a core that is a part +of a cluster of related cpus this property may contain multiple +elements which associate the core with cost nodes describing the +costs for the core itself, the cluster the core belongs to and so +on. The elements must be ordered from the lowest level nodes to the +highest desired level that EAS must service. The highest desired +level must be the same for all cpu nodes. The elements must not be +sparse: there must be elements for the current thread, the next +level of hierarchy (core) and so on without any 'holes'. + +If the system comprises of hierarchical clusters of clusters, this +property will contain multiple associations with the relevant number +of cluster elements in hierarchical order. + +Property added to the cpu node: + +- sched-energy-costs + + Usage: required + Value type: List of phandles + Definition: a list of phandles to specific cost nodes in the + energy-costs parent node that correspond to the processing + element represented by this cpu node in hierarchical order + of topology. + + The order of phandles in the list is significant. The first + phandle is to the current processing element's own cost + node. Subsequent phandles are to higher hierarchical level + cost nodes up until the maximum level that EAS is to + service. + + All cpu nodes must have the same highest level cost node. + + The phandle list must not be sparsely populated with handles + to non-contiguous hierarchical levels. See commentary above + for clarity. + + Any other configuration is invalid. + +=========================================================== +5 - Example dts +=========================================================== + +Example 1 (ARM 64-bit, 6-cpu system, two clusters of cpus, one +cluster of 2 Cortex-A57 cpus, one cluster of 4 Cortex-A53 cpus): + +cpus { + #address-cells = <2>; + #size-cells = <0>; + . + . + . + A57_0: cpu@0 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x0>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_0 &CLUSTER_COST_0>; + }; + + A57_1: cpu@1 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x1>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + clocks = <&scpi_dvfs 0>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_0 &CLUSTER_COST_0>; + }; + + A53_0: cpu@100 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x100>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_1: cpu@101 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x101>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_2: cpu@102 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x102>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + A53_3: cpu@103 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x103>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + clocks = <&scpi_dvfs 1>; + cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_1 &CLUSTER_COST_1>; + }; + + energy-costs { + CPU_COST_0: core-cost0 { + busy-cost-data = < + 417 168 + 579 251 + 744 359 + 883 479 + 1024 616 + >; + idle-cost-data = < + 15 + 0 + >; + }; + CPU_COST_1: core-cost1 { + busy-cost-data = < + 235 33 + 302 46 + 368 61 + 406 76 + 447 93 + >; + idle-cost-data = < + 6 + 0 + >; + }; + CLUSTER_COST_0: cluster-cost0 { + busy-cost-data = < + 417 24 + 579 32 + 744 43 + 883 49 + 1024 64 + >; + idle-cost-data = < + 65 + 24 + >; + }; + CLUSTER_COST_1: cluster-cost1 { + busy-cost-data = < + 235 26 + 303 30 + 368 39 + 406 47 + 447 57 + >; + idle-cost-data = < + 56 + 17 + >; + }; + }; +}; + +=============================================================================== +[1] https://lkml.org/lkml/2015/5/12/728 +[2] Documentation/devicetree/bindings/topology.txt +[3] Documentation/scheduler/sched-energy.txt
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt new file mode 100644 index 0000000..dab2f90 --- /dev/null +++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,362 @@ +Energy cost model for energy-aware scheduling (EXPERIMENTAL) + +Introduction +============= + +The basic energy model uses platform energy data stored in sched_group_energy +data structures attached to the sched_groups in the sched_domain hierarchy. The +energy cost model offers two functions that can be used to guide scheduling +decisions: + +1. static unsigned int sched_group_energy(struct energy_env *eenv) +2. static int energy_diff(struct energy_env *eenv) + +sched_group_energy() estimates the energy consumed by all cpus in a specific +sched_group including any shared resources owned exclusively by this group of +cpus. Resources shared with other cpus are excluded (e.g. later level caches). + +energy_diff() estimates the total energy impact of a utilization change. That +is, adding, removing, or migrating utilization (tasks). + +Both functions use a struct energy_env to specify the scenario to be evaluated: + + struct energy_env { + struct sched_group *sg_top; + struct sched_group *sg_cap; + int cap_idx; + int util_delta; + int src_cpu; + int dst_cpu; + int energy; + }; + +sg_top: sched_group to be evaluated. Not used by energy_diff(). + +sg_cap: sched_group covering the cpus in the same frequency domain. Set by +sched_group_energy(). + +cap_idx: Capacity state to be used for energy calculations. Set by +find_new_capacity(). + +util_delta: Amount of utilization to be added, removed, or migrated. + +src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be +-1 if no source (e.g. task wake-up). + +dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1 +if utilization is removed (e.g. terminating tasks). + +energy: Result of sched_group_energy(). + +The metric used to represent utilization is the actual per-entity running time +averaged over time using a geometric series. Very similar to the existing +per-entity load-tracking, but _not_ scaled by task priority and capped by the +capacity of the cpu. The latter property does mean that utilization may +underestimate the compute requirements for task on fully/over utilized cpus. +The greatest potential for energy savings without affecting performance too much +is scenarios where the system isn't fully utilized. If the system is deemed +fully utilized load-balancing should be done with task load (includes task +priority) instead in the interest of fairness and performance. + + +Background and Terminology +=========================== + +To make it clear from the start: + +energy = [joule] (resource like a battery on powered devices) +power = energy/time = [joule/second] = [watt] + +The goal of energy-aware scheduling is to minimize energy, while still getting +the job done. That is, we want to maximize: + + performance [inst/s] + -------------------- + power [W] + +which is equivalent to minimizing: + + energy [J] + ----------- + instruction + +while still getting 'good' performance. It is essentially an alternative +optimization objective to the current performance-only objective for the +scheduler. This alternative considers two objectives: energy-efficiency and +performance. Hence, there needs to be a user controllable knob to switch the +objective. Since it is early days, this is currently a sched_feature +(ENERGY_AWARE). + +The idea behind introducing an energy cost model is to allow the scheduler to +evaluate the implications of its decisions rather than applying energy-saving +techniques blindly that may only have positive effects on some platforms. At +the same time, the energy cost model must be as simple as possible to minimize +the scheduler latency impact. + +Platform topology +------------------ + +The system topology (cpus, caches, and NUMA information, not peripherals) is +represented in the scheduler by the sched_domain hierarchy which has +sched_groups attached at each level that covers one or more cpus (see +sched-domains.txt for more details). To add energy awareness to the scheduler +we need to consider power and frequency domains. + +Power domain: + +A power domain is a part of the system that can be powered on/off +independently. Power domains are typically organized in a hierarchy where you +may be able to power down just a cpu or a group of cpus along with any +associated resources (e.g. shared caches). Powering up a cpu means that all +power domains it is a part of in the hierarchy must be powered up. Hence, it is +more expensive to power up the first cpu that belongs to a higher level power +domain than powering up additional cpus in the same high level domain. Two +level power domain hierarchy example: + + Power source + +-------------------------------+----... +per group PD G G + | +----------+ | + +--------+-------| Shared | (other groups) +per-cpu PD G G | resource | + | | +----------+ + +-------+ +-------+ + | CPU 0 | | CPU 1 | + +-------+ +-------+ + +Frequency domain: + +Frequency domains (P-states) typically cover the same group of cpus as one of +the power domain levels. That is, there might be several smaller power domains +sharing the same frequency (P-state) or there might be a power domain spanning +multiple frequency domains. + +From a scheduling point of view there is no need to know the actual frequencies +[Hz]. All the scheduler cares about is the compute capacity available at the +current state (P-state) the cpu is in and any other available states. For that +reason, and to also factor in any cpu micro-architecture differences, compute +capacity scaling states are called 'capacity states' in this document. For SMP +systems this is equivalent to P-states. For mixed micro-architecture systems +(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture +performance relative to the other cpus in the system. + +Energy modelling: +------------------ + +Due to the hierarchical nature of the power domains, the most obvious way to +model energy costs is therefore to associate power and energy costs with +domains (groups of cpus). Energy costs of shared resources are associated with +the group of cpus that share the resources, only the cost of powering the +cpu itself and any private resources (e.g. private L1 caches) is associated +with the per-cpu groups (lowest level). + +For example, for an SMP system with per-cpu power domains and a cluster level +(group of cpus) power domain we get the overall energy costs to be: + + energy = energy_cluster + n * energy_cpu + +where 'n' is the number of cpus powered up and energy_cluster is the cost paid +as soon as any cpu in the cluster is powered up. + +The power and frequency domains can naturally be mapped onto the existing +sched_domain hierarchy and sched_groups by adding the necessary data to the +existing data structures. + +The energy model considers energy consumption from two contributors (shown in +the illustration below): + +1. Busy energy: Energy consumed while a cpu and the higher level groups that it +belongs to are busy running tasks. Busy energy is associated with the state of +the cpu, not an event. The time the cpu spends in this state varies. Thus, the +most obvious platform parameter for this contribution is busy power +(energy/time). + +2. Idle energy: Energy consumed while a cpu and higher level groups that it +belongs to are idle (in a C-state). Like busy energy, idle energy is associated +with the state of the cpu. Thus, the platform parameter for this contribution +is idle power (energy/time). + +Energy consumed during transitions from an idle-state (C-state) to a busy state +(P-state) or going the other way is ignored by the model to simplify the energy +model calculations. + + + Power + ^ + | busy->idle idle->busy + | transition transition + | + | _ __ + | / \ / \__________________ + |______________/ \ / + | \ / + | Busy \ Idle / Busy + | low P-state \____________/ high P-state + | + +------------------------------------------------------------> time + +Busy |--------------| |-----------------| + +Wakeup |------| |------| + +Idle |------------| + + +The basic algorithm +==================== + +The basic idea is to determine the total energy impact when utilization is +added or removed by estimating the impact at each level in the sched_domain +hierarchy starting from the bottom (sched_group contains just a single cpu). +The energy cost comes from busy time (sched_group is awake because one or more +cpus are busy) and idle time (in an idle-state). Energy model numbers account +for energy costs associated with all cpus in the sched_group as a group. + + for_each_domain(cpu, sd) { + sg = sched_group_of(cpu) + energy_before = curr_util(sg) * busy_power(sg) + + (1-curr_util(sg)) * idle_power(sg) + energy_after = new_util(sg) * busy_power(sg) + + (1-new_util(sg)) * idle_power(sg) + energy_diff += energy_before - energy_after + + } + + return energy_diff + +{curr, new}_util: The cpu utilization at the lowest level and the overall +non-idle time for the entire group for higher levels. Utilization is in the +range 0.0 to 1.0 in the pseudo-code. + +busy_power: The power consumption of the sched_group. + +idle_power: The power consumption of the sched_group when idle. + +Note: It is a fundamental assumption that the utilization is (roughly) scale +invariant. Task utilization tracking factors in any frequency scaling and +performance scaling differences due to difference cpu microarchitectures such +that task utilization can be used across the entire system. + + +Platform energy data +===================== + +struct sched_group_energy can be attached to sched_groups in the sched_domain +hierarchy and has the following members: + +cap_states: + List of struct capacity_state representing the supported capacity states + (P-states). struct capacity_state has two members: cap and power, which + represents the compute capacity and the busy_power of the state. The + list must be ordered by capacity low->high. + +nr_cap_states: + Number of capacity states in cap_states list. + +idle_states: + List of struct idle_state containing idle_state power cost for each + idle-state supported by the system orderd by shallowest state first. + All states must be included at all level in the hierarchy, i.e. a + sched_group spanning just a single cpu must also include coupled + idle-states (cluster states). In addition to the cpuidle idle-states, + the list must also contain an entry for the idling using the arch + default idle (arch_idle_cpu()). Despite this state may not be a true + hardware idle-state it is considered the shallowest idle-state in the + energy model and must be the first entry. cpus may enter this state + (possibly 'active idling') if cpuidle decides not enter a cpuidle + idle-state. Default idle may not be used when cpuidle is enabled. + In this case, it should just be a copy of the first cpuidle idle-state. + +nr_idle_states: + Number of idle states in idle_states list. + +There are no unit requirements for the energy cost data. Data can be normalized +with any reference, however, the normalization must be consistent across all +energy cost data. That is, one bogo-joule/watt must be the same quantity for +data, but we don't care what it is. + +A recipe for platform characterization +======================================= + +Obtaining the actual model data for a particular platform requires some way of +measuring power/energy. There isn't a tool to help with this (yet). This +section provides a recipe for use as reference. It covers the steps used to +characterize the ARM TC2 development platform. This sort of measurements is +expected to be done anyway when tuning cpuidle and cpufreq for a given +platform. + +The energy model needs two types of data (struct sched_group_energy holds +these) for each sched_group where energy costs should be taken into account: + +1. Capacity state information + +A list containing the compute capacity and power consumption when fully +utilized attributed to the group as a whole for each available capacity state. +At the lowest level (group contains just a single cpu) this is the power of the +cpu alone without including power consumed by resources shared with other cpus. +It basically needs to fit the basic modelling approach described in "Background +and Terminology" section: + + energy_system = energy_shared + n * energy_cpu + +for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at +the lowest level. 'energy_shared' is included at the next level which +represents the group of cpus among which the resources are shared. + +This model is, of course, a simplification of reality. Thus, power/energy +attributions might not always exactly represent how the hardware is designed. +Also, busy power is likely to depend on the workload. It is therefore +recommended to use a representative mix of workloads when characterizing the +capacity states. + +If the group has no capacity scaling support, the list will contain a single +state where power is the busy power attributed to the group. The capacity +should be set to a default value (1024). + +When frequency domains include multiple power domains, the group representing +the frequency domain and all child groups share capacity states. This must be +indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at +all levels that share the capacity state must have the list of capacity states +with the power set to the contribution of the individual group. + +2. Idle power information + +Stored in the idle_states list. The power number is the group idle power +consumption in each idle state as well when the group is idle but has not +entered an idle-state ('active idle' as mentioned earlier). Due to the way the +energy model is defined, the idle power of the deepest group idle state can +alternatively be accounted for in the parent group busy power. In that case the +group idle state power values are offset such that the idle power of the +deepest state is zero. It is less intuitive, but it is easier to measure as +idle power consumed by the group and the busy/idle power of the parent group +cannot be distinguished without per group measurement points. + +Measuring capacity states and idle power: + +The capacity states' capacity and power can be estimated by running a benchmark +workload at each available capacity state. By restricting the benchmark to run +on subsets of cpus it is possible to extrapolate the power consumption of +shared resources. + +ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a +shared L2 cache. TC2 has on-chip energy counters per cluster. Running a +benchmark workload on just one cpu in a cluster means that power is consumed in +the cluster (higher level group) and a single cpu (lowest level group). Adding +another benchmark task to another cpu increases the power consumption by the +amount consumed by the additional cpu. Hence, it is possible to extrapolate the +cluster busy power. + +For platforms that don't have energy counters or equivalent instrumentation +built-in, it may be possible to use an external DAQ to acquire similar data. + +If the benchmark includes some performance score (for example sysbench cpu +benchmark), this can be used to record the compute capacity. + +Measuring idle power requires insight into the idle state implementation on the +particular platform. Specifically, if the platform has coupled idle-states (or +package states). To measure non-coupled per-cpu idle-states it is necessary to +keep one cpu busy to keep any shared resources alive to isolate the idle power +of the cpu from idle/busy power of the shared resources. The cpu can be tricked +into different per-cpu idle states by disabling the other states. Based on +various combinations of measurements with specific cpus busy and disabling +idle-states it is possible to extrapolate the idle-state power.
diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt new file mode 100644 index 0000000..5df0ea3 --- /dev/null +++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,413 @@ + Central, scheduler-driven, power-performance control + (EXPERIMENTAL) + +Abstract +======== + +The topic of a single simple power-performance tunable, that is wholly +scheduler centric, and has well defined and predictable properties has come up +on several occasions in the past [1,2]. With techniques such as a scheduler +driven DVFS [3], we now have a good framework for implementing such a tunable. +This document describes the overall ideas behind its design and implementation. + + +Table of Contents +================= + +1. Motivation +2. Introduction +3. Signal Boosting Strategy +4. OPP selection using boosted CPU utilization +5. Per task group boosting +6. Per-task wakeup-placement-strategy Selection +7. Question and Answers + - What about "auto" mode? + - What about boosting on a congested system? + - How CPUs are boosted when we have tasks with multiple boost values? +8. References + + +1. Motivation +============= + +Sched-DVFS [3] was a new event-driven cpufreq governor which allows the +scheduler to select the optimal DVFS operating point (OPP) for running a task +allocated to a CPU. Later, the cpufreq maintainers introduced a similar +governor, schedutil. The introduction of schedutil also enables running +workloads at the most energy efficient OPPs. + +However, sometimes it may be desired to intentionally boost the performance of +a workload even if that could imply a reasonable increase in energy +consumption. For example, in order to reduce the response time of a task, we +may want to run the task at a higher OPP than the one that is actually required +by it's CPU bandwidth demand. + +This last requirement is especially important if we consider that one of the +main goals of the utilization-driven governor component is to replace all +currently available CPUFreq policies. Since sched-DVFS and schedutil are event +based, as opposed to the sampling driven governors we currently have, they are +already more responsive at selecting the optimal OPP to run tasks allocated to +a CPU. However, just tracking the actual task load demand may not be enough +from a performance standpoint. For example, it is not possible to get +behaviors similar to those provided by the "performance" and "interactive" +CPUFreq governors. + +This document describes an implementation of a tunable, stacked on top of the +utilization-driven governors which extends their functionality to support task +performance boosting. + +By "performance boosting" we mean the reduction of the time required to +complete a task activation, i.e. the time elapsed from a task wakeup to its +next deactivation (e.g. because it goes back to sleep or it terminates). For +example, if we consider a simple periodic task which executes the same workload +for 5[s] every 20[s] while running at a certain OPP, a boosted execution of +that task must complete each of its activations in less than 5[s]. + +A previous attempt [5] to introduce such a boosting feature has not been +successful mainly because of the complexity of the proposed solution. Previous +versions of the approach described in this document exposed a single simple +interface to user-space. This single tunable knob allowed the tuning of +system wide scheduler behaviours ranging from energy efficiency at one end +through to incremental performance boosting at the other end. This first +tunable affects all tasks. However, that is not useful for Android products +so in this version only a more advanced extension of the concept is provided +which uses CGroups to boost the performance of only selected tasks while using +the energy efficient default for all others. + +The rest of this document introduces in more details the proposed solution +which has been named SchedTune. + + +2. Introduction +=============== + +SchedTune exposes a simple user-space interface provided through a new +CGroup controller 'stune' which provides two power-performance tunables +per group: + + /<stune cgroup mount point>/schedtune.prefer_idle + /<stune cgroup mount point>/schedtune.boost + +The CGroup implementation permits arbitrary user-space defined task +classification to tune the scheduler for different goals depending on the +specific nature of the task, e.g. background vs interactive vs low-priority. + +More details are given in section 5. + +2.1 Boosting +============ + +The boost value is expressed as an integer in the range [-100..0..100]. + +A value of 0 (default) configures the CFS scheduler for maximum energy +efficiency. This means that sched-DVFS runs the tasks at the minimum OPP +required to satisfy their workload demand. + +A value of 100 configures scheduler for maximum performance, which translates +to the selection of the maximum OPP on that CPU. + +A value of -100 configures scheduler for minimum performance, which translates +to the selection of the minimum OPP on that CPU. + +The range between -100, 0 and 100 can be set to satisfy other scenarios suitably. +For example to satisfy interactive response or depending on other system events +(battery level etc). + +The overall design of the SchedTune module is built on top of "Per-Entity Load +Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating +Performance Point (OPP) selection. + +Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune +the operating frequency of that CPU to better match the workload demand. The +selection of the actual OPP being activated is influenced by the boost value +for the task CGroup. + +This simple biasing approach leverages existing frameworks, which means minimal +modifications to the scheduler, and yet it allows to achieve a range of +different behaviours all from a single simple tunable knob. + +In EAS schedulers, we use boosted task and CPU utilization for energy +calculation and energy-aware task placement. + +2.2 prefer_idle +=============== + +This is a flag which indicates to the scheduler that userspace would like +the scheduler to focus on energy or to focus on performance. + +A value of 0 (default) signals to the CFS scheduler that tasks in this group +can be placed according to the energy-aware wakeup strategy. + +A value of 1 signals to the CFS scheduler that tasks in this group should be +placed to minimise wakeup latency. + +The value is combined with the boost value - task placement will not be +boost aware however CPU OPP selection is still boost aware. + +Android platforms typically use this flag for application tasks which the +user is currently interacting with. + + +3. Signal Boosting Strategy +=========================== + +The whole PELT machinery works based on the value of a few load tracking signals +which basically track the CPU bandwidth requirements for tasks and the capacity +of CPUs. The basic idea behind the SchedTune knob is to artificially inflate +some of these load tracking signals to make a task or RQ appears more demanding +that it actually is. + +Which signals have to be inflated depends on the specific "consumer". However, +independently from the specific (signal, consumer) pair, it is important to +define a simple and possibly consistent strategy for the concept of boosting a +signal. + +A boosting strategy defines how the "abstract" user-space defined +sched_cfs_boost value is translated into an internal "margin" value to be added +to a signal to get its inflated value: + + margin := boosting_strategy(sched_cfs_boost, signal) + boosted_signal := signal + margin + +Different boosting strategies were identified and analyzed before selecting the +one found to be most effective. + +Signal Proportional Compensation (SPC) +-------------------------------------- + +In this boosting strategy the sched_cfs_boost value is used to compute a +margin which is proportional to the complement of the original signal. +When a signal has a maximum possible value, its complement is defined as +the delta from the actual value and its possible maximum. + +Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as +the maximum possible value, the margin becomes: + + margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) + +Using this boosting strategy: +- a 100% sched_cfs_boost means that the signal is scaled to the maximum value +- each value in the range of sched_cfs_boost effectively inflates the signal in + question by a quantity which is proportional to the maximum value. + +For example, by applying the SPC boosting strategy to the selection of the OPP +to run a task it is possible to achieve these behaviors: + +- 0% boosting: run the task at the minimum OPP required by its workload +- 100% boosting: run the task at the maximum OPP available for the CPU +- 50% boosting: run at the half-way OPP between minimum and maximum + +Which means that, at 50% boosting, a task will be scheduled to run at half of +the maximum theoretically achievable performance on the specific target +platform. + +A graphical representation of an SPC boosted signal is represented in the +following figure where: + a) "-" represents the original signal + b) "b" represents a 50% boosted signal + c) "p" represents a 100% boosted signal + + + ^ + | SCHED_LOAD_SCALE + +-----------------------------------------------------------------+ + |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp + | + | boosted_signal + | bbbbbbbbbbbbbbbbbbbbbbbb + | + | original signal + | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ + | | + |bbbbbbbbbbbbbbbbbb | + | | + | | + | | + | +-----------------------+ + | | + | | + | | + |------------------+ + | + | + +-----------------------------------------------------------------------> + +The plot above shows a ramped load signal (titled 'original_signal') and it's +boosted equivalent. For each step of the original signal the boosted signal +corresponding to a 50% boost is midway from the original signal and the upper +bound. Boosting by 100% generates a boosted signal which is always saturated to +the upper bound. + + +4. OPP selection using boosted CPU utilization +============================================== + +It is worth calling out that the implementation does not introduce any new load +signals. Instead, it provides an API to tune existing signals. This tuning is +done on demand and only in scheduler code paths where it is sensible to do so. +The new API calls are defined to return either the default signal or a boosted +one, depending on the value of sched_cfs_boost. This is a clean an non invasive +modification of the existing existing code paths. + +The signal representing a CPU's utilization is boosted according to the +previously described SPC boosting strategy. To sched-DVFS, this allows a CPU +(ie CFS run-queue) to appear more used then it actually is. + +Thus, with the sched_cfs_boost enabled we have the following main functions to +get the current utilization of a CPU: + + cpu_util() + boosted_cpu_util() + +The new boosted_cpu_util() is similar to the first but returns a boosted +utilization signal which is a function of the sched_cfs_boost value. + +This function is used in the CFS scheduler code paths where sched-DVFS needs to +decide the OPP to run a CPU at. +For example, this allows selecting the highest OPP for a CPU which has +the boost value set to 100%. + + +5. Per task group boosting +========================== + +On battery powered devices there usually are many background services which are +long running and need energy efficient scheduling. On the other hand, some +applications are more performance sensitive and require an interactive +response and/or maximum performance, regardless of the energy cost. + +To better service such scenarios, the SchedTune implementation has an extension +that provides a more fine grained boosting interface. + +A new CGroup controller, namely "schedtune", can be enabled which allows to +defined and configure task groups with different boosting values. +Tasks that require special performance can be put into separate CGroups. +The value of the boost associated with the tasks in this group can be specified +using a single knob exposed by the CGroup controller: + + schedtune.boost + +This knob allows the definition of a boost value that is to be used for +SPC boosting of all tasks attached to this group. + +The current schedtune controller implementation is really simple and has these +main characteristics: + + 1) It is only possible to create 1 level depth hierarchies + + The root control groups define the system-wide boost value to be applied + by default to all tasks. Its direct subgroups are named "boost groups" and + they define the boost value for specific set of tasks. + Further nested subgroups are not allowed since they do not have a sensible + meaning from a user-space standpoint. + + 2) It is possible to define only a limited number of "boost groups" + + This number is defined at compile time and by default configured to 16. + This is a design decision motivated by two main reasons: + a) In a real system we do not expect utilization scenarios with more then few + boost groups. For example, a reasonable collection of groups could be + just "background", "interactive" and "performance". + b) It simplifies the implementation considerably, especially for the code + which has to compute the per CPU boosting once there are multiple + RUNNABLE tasks with different boost values. + +Such a simple design should allow servicing the main utilization scenarios identified +so far. It provides a simple interface which can be used to manage the +power-performance of all tasks or only selected tasks. +Moreover, this interface can be easily integrated by user-space run-times (e.g. +Android, ChromeOS) to implement a QoS solution for task boosting based on tasks +classification, which has been a long standing requirement. + +Setup and usage +--------------- + +0. Use a kernel with CONFIG_SCHED_TUNE support enabled + +1. Check that the "schedtune" CGroup controller is available: + + root@linaro-nano:~# cat /proc/cgroups + #subsys_name hierarchy num_cgroups enabled + cpuset 0 1 1 + cpu 0 1 1 + schedtune 0 1 1 + +2. Mount a tmpfs to create the CGroups mount point (Optional) + + root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup + +3. Mount the "schedtune" controller + + root@linaro-nano:~# mkdir /sys/fs/cgroup/stune + root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune + +4. Create task groups and configure their specific boost value (Optional) + + For example here we create a "performance" boost group configure to boost + all its tasks to 100% + + root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance + root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost + +5. Move tasks into the boost group + + For example, the following moves the tasks with PID $TASKPID (and all its + threads) into the "performance" boost group. + + root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs + +This simple configuration allows only the threads of the $TASKPID task to run, +when needed, at the highest OPP in the most capable CPU of the system. + + +6. Per-task wakeup-placement-strategy Selection +=============================================== + +Many devices have a number of CFS tasks in use which require an absolute +minimum wakeup latency, and many tasks for which wakeup latency is not +important. + +For touch-driven environments, removing additional wakeup latency can be +critical. + +When you use the Schedtume CGroup controller, you have access to a second +parameter which allows a group to be marked such that energy_aware task +placement is bypassed for tasks belonging to that group. + +prefer_idle=0 (default - use energy-aware task placement if available) +prefer_idle=1 (never use energy-aware task placement for these tasks) + +Since the regular wakeup task placement algorithm in CFS is biased for +performance, this has the effect of restoring minimum wakeup latency +for the desired tasks whilst still allowing energy-aware wakeup placement +to save energy for other tasks. + + +7. Question and Answers +======================= + +What about "auto" mode? +----------------------- + +The 'auto' mode as described in [5] can be implemented by interfacing SchedTune +with some suitable user-space element. This element could use the exposed +system-wide or cgroup based interface. + +How are multiple groups of tasks with different boost values managed? +--------------------------------------------------------------------- + +The current SchedTune implementation keeps track of the boosted RUNNABLE tasks +on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors +(and used to select an appropriate OPP) is boosted with a value which is the +maximum of the boost values of the currently RUNNABLE tasks in its RQ. + +This allows cpufreq to boost a CPU only while there are boosted tasks ready +to run and switch back to the energy efficient mode as soon as the last boosted +task is dequeued. + + +8. References +============= +[1] http://lwn.net/Articles/552889 +[2] http://lkml.org/lkml/2012/5/18/91 +[3] http://lkml.org/lkml/2015/6/26/620
diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts index a4c7713..c5c365f 100644 --- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts +++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -41,6 +41,7 @@ cci-control-port = <&cci_control1>; cpu-idle-states = <&CLUSTER_SLEEP_BIG>; capacity-dmips-mhz = <1024>; + sched-energy-costs = <&CPU_COST_A15 &CLUSTER_COST_A15>; }; cpu1: cpu@1 { @@ -50,6 +51,7 @@ cci-control-port = <&cci_control1>; cpu-idle-states = <&CLUSTER_SLEEP_BIG>; capacity-dmips-mhz = <1024>; + sched-energy-costs = <&CPU_COST_A15 &CLUSTER_COST_A15>; }; cpu2: cpu@2 { @@ -59,6 +61,7 @@ cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; capacity-dmips-mhz = <516>; + sched-energy-costs = <&CPU_COST_A7 &CLUSTER_COST_A7>; }; cpu3: cpu@3 { @@ -68,6 +71,7 @@ cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; capacity-dmips-mhz = <516>; + sched-energy-costs = <&CPU_COST_A7 &CLUSTER_COST_A7>; }; cpu4: cpu@4 { @@ -77,6 +81,7 @@ cci-control-port = <&cci_control2>; cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>; capacity-dmips-mhz = <516>; + sched-energy-costs = <&CPU_COST_A7 &CLUSTER_COST_A7>; }; idle-states { @@ -96,6 +101,77 @@ min-residency-us = <2500>; }; }; + + energy-costs { + CPU_COST_A15: core-cost0 { + busy-cost-data = < + 426 2021 + 512 2312 + 597 2756 + 682 3125 + 768 3524 + 853 3846 + 938 5177 + 1024 6997 + >; + idle-cost-data = < + 0 + 0 + 0 + >; + }; + CPU_COST_A7: core-cost1 { + busy-cost-data = < + 150 187 + 172 275 + 215 334 + 258 407 + 301 447 + 344 549 + 387 761 + 430 1024 + >; + idle-cost-data = < + 0 + 0 + 0 + >; + }; + CLUSTER_COST_A15: cluster-cost0 { + busy-cost-data = < + 426 7920 + 512 8165 + 597 8172 + 682 8195 + 768 8265 + 853 8446 + 938 11426 + 1024 15200 + >; + idle-cost-data = < + 70 + 70 + 25 + >; + }; + CLUSTER_COST_A7: cluster-cost1 { + busy-cost-data = < + 150 2967 + 172 2792 + 215 2810 + 258 2815 + 301 2919 + 344 2847 + 387 3917 + 430 4905 + >; + idle-cost-data = < + 25 + 25 + 10 + >; + }; + }; }; memory@80000000 {
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index f59ab9b..2a786f5 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h
@@ -25,6 +25,17 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +#include <linux/arch_topology.h> + +/* Replace task scheduler's default frequency-invariant accounting */ +#define arch_scale_freq_capacity topology_get_freq_scale + +/* Replace task scheduler's default cpu-invariant accounting */ +#define arch_scale_cpu_capacity topology_get_cpu_scale + +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #else static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 24ac3ca..a86e105 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c
@@ -25,11 +25,49 @@ #include <linux/sched/topology.h> #include <linux/slab.h> #include <linux/string.h> +#include <linux/sched_energy.h> #include <asm/cpu.h> #include <asm/cputype.h> #include <asm/topology.h> +/* sd energy functions */ +static inline +const struct sched_group_energy * const cpu_core_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL0]; + unsigned long capacity; + int max_cap_idx; + + if (!sge) { + pr_warn("Invalid sched_group_energy for CPU%d\n", cpu); + return NULL; + } + + max_cap_idx = sge->nr_cap_states - 1; + capacity = sge->cap_states[max_cap_idx].cap; + + printk_deferred("cpu=%d set cpu scale %lu from energy model\n", + cpu, capacity); + + topology_set_cpu_scale(cpu, capacity); + + return sge; +} + +static inline +const struct sched_group_energy * const cpu_cluster_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL1]; + + if (!sge) { + pr_warn("Invalid sched_group_energy for Cluster%d\n", cpu); + return NULL; + } + + return sge; +} + /* * cpu capacity scale management */ @@ -169,10 +207,26 @@ static void __init parse_dt_topology(void) */ static void update_cpu_capacity(unsigned int cpu) { - if (!cpu_capacity(cpu) || cap_from_dt) - return; + const struct sched_group_energy *sge; + unsigned long capacity; - topology_set_cpu_scale(cpu, cpu_capacity(cpu) / middle_capacity); + sge = cpu_core_energy(cpu); + + if (sge) { + int max_cap_idx; + + max_cap_idx = sge->nr_cap_states - 1; + capacity = sge->cap_states[max_cap_idx].cap; + + printk_deferred("cpu=%d set cpu scale %lu from energy model\n", + cpu, capacity); + } else { + if (!cpu_capacity(cpu) || cap_from_dt) + return; + capacity = cpu_capacity(cpu) / middle_capacity; + } + + topology_set_cpu_scale(cpu, capacity); pr_info("CPU%u: update cpu_capacity %lu\n", cpu, topology_get_cpu_scale(NULL, cpu)); @@ -278,23 +332,37 @@ void store_cpu_topology(unsigned int cpuid) update_cpu_capacity(cpuid); + topology_detect_flags(); + pr_info("CPU%u: thread %d, cpu %d, socket %d, mpidr %x\n", cpuid, cpu_topology[cpuid].thread_id, cpu_topology[cpuid].core_id, cpu_topology[cpuid].socket_id, mpidr); } +#ifdef CONFIG_SCHED_MC +static int core_flags(void) +{ + return cpu_core_flags() | topology_core_flags(); +} + static inline int cpu_corepower_flags(void) { - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; + return topology_core_flags() + | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; +} +#endif + +static int cpu_flags(void) +{ + return topology_cpu_flags(); } static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC - { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, + { cpu_coregroup_mask, core_flags, cpu_core_energy, SD_INIT_NAME(MC) }, #endif - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { cpu_cpu_mask, cpu_flags, cpu_cluster_energy, SD_INIT_NAME(DIE) }, { NULL, }, }; @@ -322,4 +390,6 @@ void __init init_cpu_topology(void) /* Set scheduler topology descriptor */ set_sched_topology(arm_topology); + + init_sched_energy_costs(); }
diff --git a/arch/arm64/boot/dts/arm/juno-r2.dts b/arch/arm64/boot/dts/arm/juno-r2.dts index b39b6d6..d2467e4 100644 --- a/arch/arm64/boot/dts/arm/juno-r2.dts +++ b/arch/arm64/boot/dts/arm/juno-r2.dts
@@ -98,6 +98,7 @@ next-level-cache = <&A72_L2>; clocks = <&scpi_dvfs 0>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A72 &CLUSTER_COST_A72>; capacity-dmips-mhz = <1024>; }; @@ -115,6 +116,7 @@ next-level-cache = <&A72_L2>; clocks = <&scpi_dvfs 0>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A72 &CLUSTER_COST_A72>; capacity-dmips-mhz = <1024>; }; @@ -132,6 +134,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53R2 &CLUSTER_COST_A53R2>; capacity-dmips-mhz = <485>; }; @@ -149,6 +152,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53R2 &CLUSTER_COST_A53R2>; capacity-dmips-mhz = <485>; }; @@ -166,6 +170,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53R2 &CLUSTER_COST_A53R2>; capacity-dmips-mhz = <485>; }; @@ -183,6 +188,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53R2 &CLUSTER_COST_A53R2>; capacity-dmips-mhz = <485>; }; @@ -199,6 +205,7 @@ cache-line-size = <64>; cache-sets = <1024>; }; + /include/ "juno-sched-energy.dtsi" }; pmu_a72 {
diff --git a/arch/arm64/boot/dts/arm/juno-sched-energy.dtsi b/arch/arm64/boot/dts/arm/juno-sched-energy.dtsi new file mode 100644 index 0000000..7ffb255 --- /dev/null +++ b/arch/arm64/boot/dts/arm/juno-sched-energy.dtsi
@@ -0,0 +1,123 @@ +/* + * ARM JUNO specific energy cost model data. There are no unit requirements for + * the data. Data can be normalized to any reference point, but the + * normalization must be consistent. That is, one bogo-joule/watt must be the + * same quantity for all data, but we don't care what it is. + */ + +energy-costs { + /* Juno r0 Energy */ + CPU_COST_A57: core-cost0 { + busy-cost-data = < + 417 168 + 579 251 + 744 359 + 883 479 + 1024 616 + >; + idle-cost-data = < + 15 + 15 + 0 + 0 + >; + }; + CPU_COST_A53: core-cost1 { + busy-cost-data = < + 235 33 + 302 46 + 368 61 + 406 76 + 447 93 + >; + idle-cost-data = < + 6 + 6 + 0 + 0 + >; + }; + CLUSTER_COST_A57: cluster-cost0 { + busy-cost-data = < + 417 24 + 579 32 + 744 43 + 883 49 + 1024 64 + >; + idle-cost-data = < + 65 + 65 + 65 + 24 + >; + }; + CLUSTER_COST_A53: cluster-cost1 { + busy-cost-data = < + 235 26 + 303 30 + 368 39 + 406 47 + 447 57 + >; + idle-cost-data = < + 56 + 56 + 56 + 17 + >; + }; + /* Juno r2 Energy */ + CPU_COST_A72: core-cost2 { + busy-cost-data = < + 501 174 + 849 344 + 1024 526 + >; + idle-cost-data = < + 48 + 48 + 0 + 0 + >; + }; + CPU_COST_A53R2: core-cost3 { + busy-cost-data = < + 276 37 + 501 59 + 593 117 + >; + idle-cost-data = < + 33 + 33 + 0 + 0 + >; + }; + CLUSTER_COST_A72: cluster-cost2 { + busy-cost-data = < + 501 48 + 849 73 + 1024 107 + >; + idle-cost-data = < + 48 + 48 + 48 + 18 + >; + }; + CLUSTER_COST_A53R2: cluster-cost3 { + busy-cost-data = < + 276 41 + 501 86 + 593 107 + >; + idle-cost-data = < + 41 + 41 + 41 + 14 + >; + }; +};
diff --git a/arch/arm64/boot/dts/arm/juno.dts b/arch/arm64/boot/dts/arm/juno.dts index c9236c4..ae5306a 100644 --- a/arch/arm64/boot/dts/arm/juno.dts +++ b/arch/arm64/boot/dts/arm/juno.dts
@@ -97,6 +97,7 @@ next-level-cache = <&A57_L2>; clocks = <&scpi_dvfs 0>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A57 &CLUSTER_COST_A57>; capacity-dmips-mhz = <1024>; }; @@ -114,6 +115,7 @@ next-level-cache = <&A57_L2>; clocks = <&scpi_dvfs 0>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A57 &CLUSTER_COST_A57>; capacity-dmips-mhz = <1024>; }; @@ -131,6 +133,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53 &CLUSTER_COST_A53>; capacity-dmips-mhz = <578>; }; @@ -148,6 +151,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53 &CLUSTER_COST_A53>; capacity-dmips-mhz = <578>; }; @@ -165,6 +169,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53 &CLUSTER_COST_A53>; capacity-dmips-mhz = <578>; }; @@ -182,6 +187,7 @@ next-level-cache = <&A53_L2>; clocks = <&scpi_dvfs 1>; cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0>; + sched-energy-costs = <&CPU_COST_A53 &CLUSTER_COST_A53>; capacity-dmips-mhz = <578>; }; @@ -198,6 +204,7 @@ cache-line-size = <64>; cache-sets = <1024>; }; + /include/ "juno-sched-energy.dtsi" }; pmu_a57 {
diff --git a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi index ff1dc89..66d48e3 100644 --- a/arch/arm64/boot/dts/hisilicon/hi6220.dtsi +++ b/arch/arm64/boot/dts/hisilicon/hi6220.dtsi
@@ -92,7 +92,9 @@ cooling-max-level = <0>; #cooling-cells = <2>; /* min followed by max */ cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; dynamic-power-coefficient = <311>; + capacity-dmips-mhz = <1024>; }; cpu1: cpu@1 { @@ -103,6 +105,8 @@ next-level-cache = <&CLUSTER0_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; + capacity-dmips-mhz = <1024>; }; cpu2: cpu@2 { @@ -113,6 +117,7 @@ next-level-cache = <&CLUSTER0_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + capacity-dmips-mhz = <1024>; }; cpu3: cpu@3 { @@ -123,6 +128,8 @@ next-level-cache = <&CLUSTER0_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; + capacity-dmips-mhz = <1024>; }; cpu4: cpu@100 { @@ -133,6 +140,7 @@ next-level-cache = <&CLUSTER1_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + capacity-dmips-mhz = <1024>; }; cpu5: cpu@101 { @@ -143,6 +151,8 @@ next-level-cache = <&CLUSTER1_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; + capacity-dmips-mhz = <1024>; }; cpu6: cpu@102 { @@ -153,6 +163,8 @@ next-level-cache = <&CLUSTER1_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; + capacity-dmips-mhz = <1024>; }; cpu7: cpu@103 { @@ -163,6 +175,8 @@ next-level-cache = <&CLUSTER1_L2>; operating-points-v2 = <&cpu_opp_table>; cpu-idle-states = <&CPU_SLEEP &CLUSTER_SLEEP>; + sched-energy-costs = <&CPU_COST &CLUSTER_COST &SYSTEM_COST>; + capacity-dmips-mhz = <1024>; }; CLUSTER0_L2: l2-cache0 { @@ -172,6 +186,50 @@ CLUSTER1_L2: l2-cache1 { compatible = "cache"; }; + + energy-costs { + SYSTEM_COST: system-cost0 { + busy-cost-data = < + 1024 0 + >; + idle-cost-data = < + 0 + 0 + 0 + 0 + >; + }; + CLUSTER_COST: cluster-cost0 { + busy-cost-data = < + 178 16 + 369 29 + 622 47 + 819 75 + 1024 112 + >; + idle-cost-data = < + 107 + 107 + 47 + 0 + >; + }; + CPU_COST: core-cost0 { + busy-cost-data = < + 178 69 + 369 125 + 622 224 + 819 367 + 1024 670 + >; + idle-cost-data = < + 15 + 15 + 0 + 0 + >; + }; + }; }; cpu_opp_table: cpu_opp_table {
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 34480e9..36e2111 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig
@@ -4,6 +4,7 @@ CONFIG_NO_HZ_IDLE=y CONFIG_HIGH_RES_TIMERS=y CONFIG_IRQ_TIME_ACCOUNTING=y +CONFIG_SCHED_WALT=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y @@ -24,8 +25,9 @@ CONFIG_CGROUP_PERF=y CONFIG_USER_NS=y CONFIG_SCHED_AUTOGROUP=y +CONFIG_SCHED_TUNE=y +CONFIG_DEFAULT_USE_ENERGY_AWARE=y CONFIG_BLK_DEV_INITRD=y -CONFIG_KALLSYMS_ALL=y # CONFIG_COMPAT_BRK is not set CONFIG_PROFILING=y CONFIG_JUMP_LABEL=y @@ -69,13 +71,13 @@ CONFIG_PCI_LAYERSCAPE=y CONFIG_PCI_HISI=y CONFIG_PCIE_QCOM=y -CONFIG_PCIE_KIRIN=y CONFIG_PCIE_ARMADA_8K=y +CONFIG_PCIE_KIRIN=y CONFIG_PCI_AARDVARK=y CONFIG_PCIE_RCAR=y -CONFIG_PCIE_ROCKCHIP=m CONFIG_PCI_HOST_GENERIC=y CONFIG_PCI_XGENE=y +CONFIG_PCIE_ROCKCHIP=m CONFIG_ARM64_VA_BITS_48=y CONFIG_SCHED_MC=y CONFIG_NUMA=y @@ -93,6 +95,12 @@ CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y CONFIG_ARM_CPUIDLE=y CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_STAT=y +CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y +CONFIG_CPU_FREQ_GOV_POWERSAVE=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y +CONFIG_CPU_FREQ_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y CONFIG_CPUFREQ_DT=y CONFIG_ARM_BIG_LITTLE_CPUFREQ=y CONFIG_ARM_SCPI_CPUFREQ=y @@ -140,11 +148,10 @@ CONFIG_BT_LEDS=y # CONFIG_BT_DEBUGFS is not set CONFIG_BT_HCIUART=m -CONFIG_BT_HCIUART_LL=y -CONFIG_CFG80211=m -CONFIG_MAC80211=m +CONFIG_CFG80211=y +CONFIG_MAC80211=y CONFIG_MAC80211_LEDS=y -CONFIG_RFKILL=m +CONFIG_RFKILL=y CONFIG_NET_9P=y CONFIG_NET_9P_VIRTIO=y CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug" @@ -210,21 +217,16 @@ CONFIG_ROCKCHIP_PHY=y CONFIG_USB_PEGASUS=m CONFIG_USB_RTL8150=m -CONFIG_USB_RTL8152=m -CONFIG_USB_USBNET=m -CONFIG_USB_NET_DM9601=m -CONFIG_USB_NET_SR9800=m -CONFIG_USB_NET_SMSC75XX=m -CONFIG_USB_NET_SMSC95XX=m -CONFIG_USB_NET_PLUSB=m -CONFIG_USB_NET_MCS7830=m +CONFIG_USB_RTL8152=y CONFIG_BRCMFMAC=m +CONFIG_RTL_CARDS=m CONFIG_WL18XX=m CONFIG_WLCORE_SDIO=m +CONFIG_USB_NET_RNDIS_WLAN=y CONFIG_INPUT_EVDEV=y CONFIG_KEYBOARD_ADC=m -CONFIG_KEYBOARD_CROS_EC=y CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_CROS_EC=y CONFIG_INPUT_MISC=y CONFIG_INPUT_PM8941_PWRKEY=y CONFIG_INPUT_HISI_POWERKEY=y @@ -275,20 +277,20 @@ CONFIG_I2C_RCAR=y CONFIG_I2C_CROS_EC_TUNNEL=y CONFIG_SPI=y -CONFIG_SPI_MESON_SPICC=m -CONFIG_SPI_MESON_SPIFC=m CONFIG_SPI_BCM2835=m CONFIG_SPI_BCM2835AUX=m +CONFIG_SPI_MESON_SPICC=m +CONFIG_SPI_MESON_SPIFC=m CONFIG_SPI_ORION=y CONFIG_SPI_PL022=y -CONFIG_SPI_QUP=y CONFIG_SPI_ROCKCHIP=y +CONFIG_SPI_QUP=y CONFIG_SPI_S3C64XX=y CONFIG_SPI_SPIDEV=m CONFIG_SPMI=y -CONFIG_PINCTRL_IPQ8074=y CONFIG_PINCTRL_SINGLE=y CONFIG_PINCTRL_MAX77620=y +CONFIG_PINCTRL_IPQ8074=y CONFIG_PINCTRL_MSM8916=y CONFIG_PINCTRL_MSM8994=y CONFIG_PINCTRL_MSM8996=y @@ -313,9 +315,8 @@ CONFIG_THERMAL_GOV_POWER_ALLOCATOR=y CONFIG_CPU_THERMAL=y CONFIG_THERMAL_EMULATION=y -CONFIG_BRCMSTB_THERMAL=m -CONFIG_EXYNOS_THERMAL=y CONFIG_ROCKCHIP_THERMAL=m +CONFIG_EXYNOS_THERMAL=y CONFIG_WATCHDOG=y CONFIG_S3C2410_WATCHDOG=y CONFIG_MESON_GXBB_WATCHDOG=m @@ -334,9 +335,9 @@ CONFIG_MFD_SPMI_PMIC=y CONFIG_MFD_RK808=y CONFIG_MFD_SEC_CORE=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y CONFIG_REGULATOR_AXP20X=y CONFIG_REGULATOR_FAN53555=y -CONFIG_REGULATOR_FIXED_VOLTAGE=y CONFIG_REGULATOR_GPIO=y CONFIG_REGULATOR_HI6421V530=y CONFIG_REGULATOR_HI655X=y @@ -346,16 +347,13 @@ CONFIG_REGULATOR_QCOM_SPMI=y CONFIG_REGULATOR_RK808=y CONFIG_REGULATOR_S2MPS11=y +CONFIG_RC_DEVICES=y +CONFIG_IR_MESON=m CONFIG_MEDIA_SUPPORT=m CONFIG_MEDIA_CAMERA_SUPPORT=y CONFIG_MEDIA_ANALOG_TV_SUPPORT=y CONFIG_MEDIA_DIGITAL_TV_SUPPORT=y CONFIG_MEDIA_CONTROLLER=y -CONFIG_MEDIA_RC_SUPPORT=y -CONFIG_RC_CORE=m -CONFIG_RC_DEVICES=y -CONFIG_RC_DECODERS=y -CONFIG_IR_MESON=m CONFIG_VIDEO_V4L2_SUBDEV_API=y # CONFIG_DVB_NET is not set CONFIG_V4L_MEM2MEM_DRIVERS=y @@ -393,7 +391,6 @@ CONFIG_BACKLIGHT_GENERIC=m CONFIG_BACKLIGHT_PWM=m CONFIG_BACKLIGHT_LP855X=m -CONFIG_FRAMEBUFFER_CONSOLE=y CONFIG_LOGO=y # CONFIG_LOGO_LINUX_MONO is not set # CONFIG_LOGO_LINUX_VGA16 is not set @@ -492,7 +489,6 @@ CONFIG_COMMON_CLK_RK808=y CONFIG_COMMON_CLK_SCPI=y CONFIG_COMMON_CLK_CS2000_CP=y -CONFIG_COMMON_CLK_S2MPS11=y CONFIG_CLK_QORIQ=y CONFIG_COMMON_CLK_PWM=y CONFIG_COMMON_CLK_QCOM=y @@ -531,13 +527,13 @@ CONFIG_PWM_ROCKCHIP=y CONFIG_PWM_SAMSUNG=y CONFIG_PWM_TEGRA=m -CONFIG_PHY_RCAR_GEN3_USB2=y -CONFIG_PHY_HI6220_USB=y -CONFIG_PHY_SUN4I_USB=y -CONFIG_PHY_ROCKCHIP_INNO_USB2=y -CONFIG_PHY_ROCKCHIP_EMMC=y -CONFIG_PHY_ROCKCHIP_PCIE=m CONFIG_PHY_XGENE=y +CONFIG_PHY_SUN4I_USB=y +CONFIG_PHY_HI6220_USB=y +CONFIG_PHY_RCAR_GEN3_USB2=y +CONFIG_PHY_ROCKCHIP_EMMC=y +CONFIG_PHY_ROCKCHIP_INNO_USB2=y +CONFIG_PHY_ROCKCHIP_PCIE=m CONFIG_PHY_TEGRA_XUSB=y CONFIG_QCOM_L2_PMU=y CONFIG_QCOM_L3_PMU=y @@ -579,29 +575,27 @@ CONFIG_KVM=y CONFIG_PRINTK_TIME=y CONFIG_DEBUG_INFO=y -CONFIG_DEBUG_FS=y CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_KERNEL=y -CONFIG_LOCKUP_DETECTOR=y -# CONFIG_SCHED_DEBUG is not set +CONFIG_SCHEDSTATS=y # CONFIG_DEBUG_PREEMPT is not set -# CONFIG_FTRACE is not set +CONFIG_PROVE_LOCKING=y +CONFIG_FUNCTION_TRACER=y +CONFIG_IRQSOFF_TRACER=y +CONFIG_PREEMPT_TRACER=y +CONFIG_SCHED_TRACER=y CONFIG_MEMTEST=y CONFIG_SECURITY=y CONFIG_CRYPTO_ECHAINIV=y CONFIG_CRYPTO_ANSI_CPRNG=y CONFIG_ARM64_CRYPTO=y -CONFIG_CRYPTO_SHA256_ARM64=m CONFIG_CRYPTO_SHA512_ARM64=m CONFIG_CRYPTO_SHA1_ARM64_CE=y CONFIG_CRYPTO_SHA2_ARM64_CE=y CONFIG_CRYPTO_GHASH_ARM64_CE=y CONFIG_CRYPTO_CRCT10DIF_ARM64_CE=m CONFIG_CRYPTO_CRC32_ARM64_CE=m -CONFIG_CRYPTO_AES_ARM64=m -CONFIG_CRYPTO_AES_ARM64_CE=m CONFIG_CRYPTO_AES_ARM64_CE_CCM=y CONFIG_CRYPTO_AES_ARM64_CE_BLK=y -CONFIG_CRYPTO_AES_ARM64_NEON_BLK=m CONFIG_CRYPTO_CHACHA20_NEON=m CONFIG_CRYPTO_AES_ARM64_BS=m
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index b320228..13a00ec 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h
@@ -33,6 +33,17 @@ int pcibus_to_node(struct pci_bus *bus); #endif /* CONFIG_NUMA */ +#include <linux/arch_topology.h> + +/* Replace task scheduler's default frequency-invariant accounting */ +#define arch_scale_freq_capacity topology_get_freq_scale + +/* Replace task scheduler's default cpu-invariant accounting */ +#define arch_scale_cpu_capacity topology_get_cpu_scale + +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #include <asm-generic/topology.h> #endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 8d48b23..35dbdfa 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c
@@ -21,6 +21,7 @@ #include <linux/of.h> #include <linux/sched.h> #include <linux/sched/topology.h> +#include <linux/sched_energy.h> #include <linux/slab.h> #include <linux/string.h> @@ -187,6 +188,8 @@ static int __init parse_dt_topology(void) if (!map) goto out; + init_sched_energy_costs(); + ret = parse_cluster(map, 0); if (ret != 0) goto out_map; @@ -280,8 +283,89 @@ void store_cpu_topology(unsigned int cpuid) topology_populated: update_siblings_masks(cpuid); + topology_detect_flags(); } +#ifdef CONFIG_SCHED_SMT +static int smt_flags(void) +{ + return cpu_smt_flags() | topology_smt_flags(); +} +#endif + +#ifdef CONFIG_SCHED_MC +static int core_flags(void) +{ + return cpu_core_flags() | topology_core_flags(); +} +#endif + +static int cpu_flags(void) +{ + return topology_cpu_flags(); +} + +static inline +const struct sched_group_energy * const cpu_core_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL0]; + unsigned long capacity; + int max_cap_idx; + + if (!sge) { + pr_warn("Invalid sched_group_energy for CPU%d\n", cpu); + return NULL; + } + + max_cap_idx = sge->nr_cap_states - 1; + capacity = sge->cap_states[max_cap_idx].cap; + + printk_deferred("cpu=%d set cpu scale %lu from energy model\n", + cpu, capacity); + + topology_set_cpu_scale(cpu, capacity); + + return sge; +} + +static inline +const struct sched_group_energy * const cpu_cluster_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL1]; + + if (!sge) { + pr_warn("Invalid sched_group_energy for Cluster%d\n", cpu); + return NULL; + } + + return sge; +} + +static inline +const struct sched_group_energy * const cpu_system_energy(int cpu) +{ + struct sched_group_energy *sge = sge_array[cpu][SD_LEVEL2]; + + if (!sge) { + pr_warn("Invalid sched_group_energy for System%d\n", cpu); + return NULL; + } + + return sge; +} + +static struct sched_domain_topology_level arm64_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, smt_flags, SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, core_flags, cpu_core_energy, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, cpu_flags, cpu_cluster_energy, SD_INIT_NAME(DIE) }, + { cpu_cpu_mask, NULL, cpu_system_energy, SD_INIT_NAME(SYS) }, + { NULL, } +}; + static void __init reset_cpu_topology(void) { unsigned int cpu; @@ -310,4 +394,6 @@ void __init init_cpu_topology(void) */ if (of_have_populated_dt() && parse_dt_topology()) reset_cpu_topology(); + else + set_sched_topology(arm64_topology); }
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 6df7d66..cf92aa8 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c
@@ -21,14 +21,25 @@ #include <linux/slab.h> #include <linux/string.h> #include <linux/sched/topology.h> +#include <linux/cpuset.h> +#include <linux/sched_energy.h> + +DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE; + +void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, + unsigned long max_freq) +{ + unsigned long scale; + int i; + + scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq; + + for_each_cpu(i, cpus) + per_cpu(freq_scale, i) = scale; +} static DEFINE_MUTEX(cpu_scale_mutex); -static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; - -unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu) -{ - return per_cpu(cpu_scale, cpu); -} +DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) { @@ -44,6 +55,9 @@ static ssize_t cpu_capacity_show(struct device *dev, return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } +static void update_topology_flags_workfn(struct work_struct *work); +static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); + static ssize_t cpu_capacity_store(struct device *dev, struct device_attribute *attr, const char *buf, @@ -54,10 +68,15 @@ static ssize_t cpu_capacity_store(struct device *dev, int i; unsigned long new_capacity; ssize_t ret; + cpumask_var_t mask; if (!count) return 0; + /* don't allow changes if sched-group-energy is installed */ + if(sched_energy_installed(this_cpu)) + return -EINVAL; + ret = kstrtoul(buf, 0, &new_capacity); if (ret) return ret; @@ -65,10 +84,41 @@ static ssize_t cpu_capacity_store(struct device *dev, return -EINVAL; mutex_lock(&cpu_scale_mutex); - for_each_cpu(i, &cpu_topology[this_cpu].core_sibling) + + if (new_capacity < SCHED_CAPACITY_SCALE) { + int highest_score_cpu = 0; + + if (!alloc_cpumask_var(&mask, GFP_KERNEL)) { + mutex_unlock(&cpu_scale_mutex); + return -ENOMEM; + } + + cpumask_andnot(mask, cpu_online_mask, + topology_core_cpumask(this_cpu)); + + for_each_cpu(i, mask) { + if (topology_get_cpu_scale(NULL, i) == + SCHED_CAPACITY_SCALE) { + highest_score_cpu = 1; + break; + } + } + + free_cpumask_var(mask); + + if (!highest_score_cpu) { + mutex_unlock(&cpu_scale_mutex); + return -EINVAL; + } + } + + for_each_cpu(i, topology_core_cpumask(this_cpu)) topology_set_cpu_scale(i, new_capacity); mutex_unlock(&cpu_scale_mutex); + if (topology_detect_flags()) + schedule_work(&update_topology_flags_work); + return count; } @@ -93,6 +143,185 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +enum asym_cpucap_type { no_asym, asym_thread, asym_core, asym_die }; +static enum asym_cpucap_type asym_cpucap = no_asym; +enum share_cap_type { no_share_cap, share_cap_thread, share_cap_core, share_cap_die}; +static enum share_cap_type share_cap = no_share_cap; + +#ifdef CONFIG_CPU_FREQ +int detect_share_cap_flag(void) +{ + int cpu; + enum share_cap_type share_cap_level = no_share_cap; + struct cpufreq_policy *policy; + + for_each_possible_cpu(cpu) { + policy = cpufreq_cpu_get(cpu); + + if (!policy) + return 0; + + if (cpumask_equal(topology_sibling_cpumask(cpu), + policy->related_cpus)) { + share_cap_level = share_cap_thread; + continue; + } + + if (cpumask_equal(topology_core_cpumask(cpu), + policy->related_cpus)) { + share_cap_level = share_cap_core; + continue; + } + + if (cpumask_equal(cpu_cpu_mask(cpu), + policy->related_cpus)) { + share_cap_level = share_cap_die; + continue; + } + } + + if (share_cap != share_cap_level) { + share_cap = share_cap_level; + return 1; + } + + return 0; +} +#else +int detect_share_cap_flag(void) { return 0; } +#endif + +/* + * Walk cpu topology to determine sched_domain flags. + * + * SD_ASYM_CPUCAPACITY: Indicates the lowest level that spans all cpu + * capacities found in the system for all cpus, i.e. the flag is set + * at the same level for all systems. The current algorithm implements + * this by looking for higher capacities, which doesn't work for all + * conceivable topology, but don't complicate things until it is + * necessary. + */ +int topology_detect_flags(void) +{ + unsigned long max_capacity, capacity; + enum asym_cpucap_type asym_level = no_asym; + int cpu, die_cpu, core, thread, flags_changed = 0; + + for_each_possible_cpu(cpu) { + max_capacity = 0; + + if (asym_level >= asym_thread) + goto check_core; + + for_each_cpu(thread, topology_sibling_cpumask(cpu)) { + capacity = topology_get_cpu_scale(NULL, thread); + + if (capacity > max_capacity) { + if (max_capacity != 0) + asym_level = asym_thread; + + max_capacity = capacity; + } + } + +check_core: + if (asym_level >= asym_core) + goto check_die; + + for_each_cpu(core, topology_core_cpumask(cpu)) { + capacity = topology_get_cpu_scale(NULL, core); + + if (capacity > max_capacity) { + if (max_capacity != 0) + asym_level = asym_core; + + max_capacity = capacity; + } + } +check_die: + for_each_possible_cpu(die_cpu) { + capacity = topology_get_cpu_scale(NULL, die_cpu); + + if (capacity > max_capacity) { + if (max_capacity != 0) { + asym_level = asym_die; + goto done; + } + } + } + } + +done: + if (asym_cpucap != asym_level) { + asym_cpucap = asym_level; + flags_changed = 1; + pr_debug("topology flag change detected\n"); + } + + if (detect_share_cap_flag()) + flags_changed = 1; + + return flags_changed; +} + +int topology_smt_flags(void) +{ + int flags = 0; + + if (asym_cpucap == asym_thread) + flags |= SD_ASYM_CPUCAPACITY; + + if (share_cap == share_cap_thread) + flags |= SD_SHARE_CAP_STATES; + + return flags; +} + +int topology_core_flags(void) +{ + int flags = 0; + + if (asym_cpucap == asym_core) + flags |= SD_ASYM_CPUCAPACITY; + + if (share_cap == share_cap_core) + flags |= SD_SHARE_CAP_STATES; + + return flags; +} + +int topology_cpu_flags(void) +{ + int flags = 0; + + if (asym_cpucap == asym_die) + flags |= SD_ASYM_CPUCAPACITY; + + if (share_cap == share_cap_die) + flags |= SD_SHARE_CAP_STATES; + + return flags; +} + +static int update_topology = 0; + +int topology_update_cpu_topology(void) +{ + return update_topology; +} + +/* + * Updating the sched_domains can't be done directly from cpufreq callbacks + * due to locking, so queue the work for later. + */ +static void update_topology_flags_workfn(struct work_struct *work) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 capacity_scale; static u32 *raw_capacity; @@ -115,13 +344,12 @@ void topology_normalize_cpu_scale(void) pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); mutex_lock(&cpu_scale_mutex); for_each_possible_cpu(cpu) { - pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", - cpu, raw_capacity[cpu]); capacity = (raw_capacity[cpu] << SCHED_CAPACITY_SHIFT) / capacity_scale; topology_set_cpu_scale(cpu, capacity); - pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", - cpu, topology_get_cpu_scale(NULL, cpu)); + pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu raw_capacity=%u\n", + cpu, topology_get_cpu_scale(NULL, cpu), + raw_capacity[cpu]); } mutex_unlock(&cpu_scale_mutex); } @@ -129,14 +357,19 @@ void topology_normalize_cpu_scale(void) bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { static bool cap_parsing_failed; - int ret; + int ret = 0; u32 cpu_capacity; if (cap_parsing_failed) return false; - ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", + /* override capacity-dmips-mhz if we have sched-energy-costs */ + if (of_find_property(cpu_node, "sched-energy-costs", NULL)) + cpu_capacity = topology_get_cpu_scale(NULL, cpu); + else + ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", &cpu_capacity); + if (!ret) { if (!raw_capacity) { raw_capacity = kcalloc(num_possible_cpus(), @@ -198,6 +431,8 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); + if (topology_detect_flags()) + schedule_work(&update_topology_flags_work); free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); schedule_work(&parsing_done_work); @@ -212,6 +447,8 @@ static struct notifier_block init_cpu_capacity_notifier __initdata = { static int __init register_cpufreq_notifier(void) { + int ret; + /* * on ACPI-based systems we need to use the default cpu capacity * until we have the necessary code to parse the cpu capacity, so @@ -227,8 +464,13 @@ static int __init register_cpufreq_notifier(void) cpumask_copy(cpus_to_visit, cpu_possible_mask); - return cpufreq_register_notifier(&init_cpu_capacity_notifier, - CPUFREQ_POLICY_NOTIFIER); + ret = cpufreq_register_notifier(&init_cpu_capacity_notifier, + CPUFREQ_POLICY_NOTIFIER); + + if (ret) + free_cpumask_var(cpus_to_visit); + + return ret; } core_initcall(register_cpufreq_notifier); @@ -236,6 +478,7 @@ static void __init parsing_done_workfn(struct work_struct *work) { cpufreq_unregister_notifier(&init_cpu_capacity_notifier, CPUFREQ_POLICY_NOTIFIER); + free_cpumask_var(cpus_to_visit); } #else
diff --git a/drivers/cpufreq/arm_big_little.c b/drivers/cpufreq/arm_big_little.c index 1750412..0c41ab3 100644 --- a/drivers/cpufreq/arm_big_little.c +++ b/drivers/cpufreq/arm_big_little.c
@@ -213,6 +213,7 @@ static int bL_cpufreq_set_target(struct cpufreq_policy *policy, { u32 cpu = policy->cpu, cur_cluster, new_cluster, actual_cluster; unsigned int freqs_new; + int ret; cur_cluster = cpu_to_cluster(cpu); new_cluster = actual_cluster = per_cpu(physical_cluster, cpu); @@ -229,7 +230,14 @@ static int bL_cpufreq_set_target(struct cpufreq_policy *policy, } } - return bL_cpufreq_set_rate(cpu, actual_cluster, new_cluster, freqs_new); + ret = bL_cpufreq_set_rate(cpu, actual_cluster, new_cluster, freqs_new); + + if (!ret) { + arch_set_freq_scale(policy->related_cpus, freqs_new, + policy->cpuinfo.max_freq); + } + + return ret; } static inline u32 get_table_count(struct cpufreq_frequency_table *table)
diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c index d83ab94..545946a 100644 --- a/drivers/cpufreq/cpufreq-dt.c +++ b/drivers/cpufreq/cpufreq-dt.c
@@ -43,9 +43,17 @@ static struct freq_attr *cpufreq_dt_attr[] = { static int set_target(struct cpufreq_policy *policy, unsigned int index) { struct private_data *priv = policy->driver_data; + unsigned long freq = policy->freq_table[index].frequency; + int ret; - return dev_pm_opp_set_rate(priv->cpu_dev, - policy->freq_table[index].frequency * 1000); + ret = dev_pm_opp_set_rate(priv->cpu_dev, freq * 1000); + + if (!ret) { + arch_set_freq_scale(policy->related_cpus, freq, + policy->cpuinfo.max_freq); + } + + return ret; } /*
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 91e6bee..183e1ed 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c
@@ -2432,6 +2432,17 @@ int cpufreq_boost_enabled(void) EXPORT_SYMBOL_GPL(cpufreq_boost_enabled); /********************************************************************* + * FREQUENCY INVARIANT ACCOUNTING SUPPORT * + *********************************************************************/ + +__weak void arch_set_freq_scale(struct cpumask *cpus, + unsigned long cur_freq, + unsigned long max_freq) +{ +} +EXPORT_SYMBOL_GPL(arch_set_freq_scale); + +/********************************************************************* * REGISTER / UNREGISTER CPUFREQ DRIVER * *********************************************************************/ static enum cpuhp_state hp_online;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index 484cc89..21a1833e 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c
@@ -211,7 +211,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, } /* Take note of the planned idle state. */ - sched_idle_set_state(target_state); + sched_idle_set_state(target_state, index); trace_cpu_idle_rcuidle(index, dev->cpu); time_start = ns_to_ktime(local_clock()); @@ -225,7 +225,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); /* The cpu is no longer idle or about to enter idle. */ - sched_idle_set_state(NULL); + sched_idle_set_state(NULL, -1); if (broadcast) { if (WARN_ON_ONCE(!irqs_disabled()))
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index d4fcb0e..e7fe036 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h
@@ -6,15 +6,35 @@ #define _LINUX_ARCH_TOPOLOGY_H_ #include <linux/types.h> +#include <linux/percpu.h> void topology_normalize_cpu_scale(void); +int topology_detect_flags(void); +int topology_smt_flags(void); +int topology_core_flags(void); +int topology_cpu_flags(void); +int topology_update_cpu_topology(void); struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu); +DECLARE_PER_CPU(unsigned long, cpu_scale); + struct sched_domain; -unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu); +static inline +unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu) +{ + return per_cpu(cpu_scale, cpu); +} void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity); +DECLARE_PER_CPU(unsigned long, freq_scale); + +static inline +unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu) +{ + return per_cpu(freq_scale, cpu); +} + #endif /* _LINUX_ARCH_TOPOLOGY_H_ */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index acb77dcf..8996c09 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h
@@ -21,6 +21,10 @@ SUBSYS(cpu) SUBSYS(cpuacct) #endif +#if IS_ENABLED(CONFIG_SCHED_TUNE) +SUBSYS(schedtune) +#endif + #if IS_ENABLED(CONFIG_BLK_CGROUP) SUBSYS(io) #endif
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 537ff842..28734ee 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h
@@ -919,6 +919,9 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy) extern unsigned int arch_freq_get_on_cpu(int cpu); +extern void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, + unsigned long max_freq); + /* the following are really really optional */ extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs; extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 8f7788d..b7c12ee 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h
@@ -214,7 +214,7 @@ static inline void cpuidle_use_deepest_state(bool enable) #endif /* kernel/sched/idle.c */ -extern void sched_idle_set_state(struct cpuidle_state *idle_state); +extern void sched_idle_set_state(struct cpuidle_state *idle_state, int index); extern void default_idle_call(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f2..30c35a2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h
@@ -166,6 +166,15 @@ struct task_group; /* Task command name length: */ #define TASK_COMM_LEN 16 +enum task_event { + PUT_PREV_TASK = 0, + PICK_NEXT_TASK = 1, + TASK_WAKE = 2, + TASK_MIGRATE = 3, + TASK_UPDATE = 4, + IRQ_UPDATE = 5, +}; + extern cpumask_var_t cpu_isolated_map; extern void scheduler_tick(void); @@ -410,6 +419,41 @@ struct sched_entity { #endif }; +#ifdef CONFIG_SCHED_WALT +#define RAVG_HIST_SIZE_MAX 5 + +/* ravg represents frequency scaled cpu-demand of tasks */ +struct ravg { + /* + * 'mark_start' marks the beginning of an event (task waking up, task + * starting to execute, task being preempted) within a window + * + * 'sum' represents how runnable a task has been within current + * window. It incorporates both running time and wait time and is + * frequency scaled. + * + * 'sum_history' keeps track of history of 'sum' seen over previous + * RAVG_HIST_SIZE windows. Windows where task was entirely sleeping are + * ignored. + * + * 'demand' represents maximum sum seen over previous + * sysctl_sched_ravg_hist_size windows. 'demand' could drive frequency + * demand for tasks. + * + * 'curr_window' represents task's contribution to cpu busy time + * statistics (rq->curr_runnable_sum) in current window + * + * 'prev_window' represents task's contribution to cpu busy time + * statistics (rq->prev_runnable_sum) in previous window + */ + u64 mark_start; + u32 sum, demand; + u32 sum_history[RAVG_HIST_SIZE_MAX]; + u32 curr_window, prev_window; + u16 active_windows; +}; +#endif + struct sched_rt_entity { struct list_head run_list; unsigned long timeout; @@ -562,6 +606,16 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; struct sched_rt_entity rt; +#ifdef CONFIG_SCHED_WALT + struct ravg ravg; + /* + * 'init_load_pct' represents the initial task load assigned to children + * of this task + */ + u32 init_load_pct; + u64 last_sleep_ts; +#endif + #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index d1ad3d8..0b55834 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h
@@ -12,8 +12,6 @@ #define SCHED_CPUFREQ_DL (1U << 1) #define SCHED_CPUFREQ_IOWAIT (1U << 2) -#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL) - #ifdef CONFIG_CPU_FREQ struct update_util_data { void (*func)(struct update_util_data *data, u64 time, unsigned int flags);
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d6a18a3..e076ff8 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h
@@ -21,8 +21,16 @@ enum { sysctl_hung_task_timeout_secs = 0 }; extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_sync_hint_enable; +extern unsigned int sysctl_sched_cstate_aware; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_child_runs_first; +#ifdef CONFIG_SCHED_WALT +extern unsigned int sysctl_sched_use_walt_cpu_util; +extern unsigned int sysctl_sched_use_walt_task_util; +extern unsigned int sysctl_sched_walt_init_task_load_pct; +extern unsigned int sysctl_sched_walt_cpu_high_irqload; +#endif enum sched_tunable_scaling { SCHED_TUNABLESCALING_NONE,
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index cf257c2e..e0161c3d 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h
@@ -26,6 +26,7 @@ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ #define SD_NUMA 0x4000 /* cross-node balancing */ +#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */ /* * Increase resolution of cpu_capacity calculations @@ -66,12 +67,30 @@ struct sched_domain_attr { extern int sched_domain_level_max; +struct capacity_state { + unsigned long cap; /* compute capacity */ + unsigned long power; /* power consumption at this compute capacity */ +}; + +struct idle_state { + unsigned long power; /* power consumption in this idle state */ +}; + +struct sched_group_energy { + unsigned int nr_idle_states; /* number of idle states */ + struct idle_state *idle_states; /* ptr to idle state array */ + unsigned int nr_cap_states; /* number of capacity states */ + struct capacity_state *cap_states; /* ptr to capacity state array */ +}; + struct sched_group; struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + + bool overutilized; }; struct sched_domain { @@ -173,6 +192,8 @@ bool cpus_share_cache(int this_cpu, int that_cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); +typedef +const struct sched_group_energy * const(*sched_domain_energy_f)(int cpu); #define SDTL_OVERLAP 0x01 @@ -186,6 +207,7 @@ struct sd_data { struct sched_domain_topology_level { sched_domain_mask_f mask; sched_domain_flags_f sd_flags; + sched_domain_energy_f energy; int flags; int numa_level; struct sd_data data;
diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h index 10b19a1..a3661e9 100644 --- a/include/linux/sched/wake_q.h +++ b/include/linux/sched/wake_q.h
@@ -34,6 +34,7 @@ struct wake_q_head { struct wake_q_node *first; struct wake_q_node **lastp; + int count; }; #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01) @@ -45,6 +46,7 @@ static inline void wake_q_init(struct wake_q_head *head) { head->first = WAKE_Q_TAIL; head->lastp = &head->first; + head->count = 0; } extern void wake_q_add(struct wake_q_head *head,
diff --git a/include/linux/sched_energy.h b/include/linux/sched_energy.h new file mode 100644 index 0000000..83d7178 --- /dev/null +++ b/include/linux/sched_energy.h
@@ -0,0 +1,41 @@ +#ifndef _LINUX_SCHED_ENERGY_H +#define _LINUX_SCHED_ENERGY_H + +#include <linux/sched.h> +#include <linux/slab.h> + +/* + * There doesn't seem to be an NR_CPUS style max number of sched domain + * levels so here's an arbitrary constant one for the moment. + * + * The levels alluded to here correspond to entries in struct + * sched_domain_topology_level that are meant to be populated by arch + * specific code (topology.c). + */ +#define NR_SD_LEVELS 8 + +#define SD_LEVEL0 0 +#define SD_LEVEL1 1 +#define SD_LEVEL2 2 +#define SD_LEVEL3 3 +#define SD_LEVEL4 4 +#define SD_LEVEL5 5 +#define SD_LEVEL6 6 +#define SD_LEVEL7 7 + +/* + * Convenience macro for iterating through said sd levels. + */ +#define for_each_possible_sd_level(level) \ + for (level = 0; level < NR_SD_LEVELS; level++) + +extern struct sched_group_energy *sge_array[NR_CPUS][NR_SD_LEVELS]; + +#ifdef CONFIG_GENERIC_ARCH_TOPOLOGY +void init_sched_energy_costs(void); +int sched_energy_installed(int cpu); +#else +void init_sched_energy_costs(void) {} +#endif + +#endif
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index c4cd9e2..c620ae9 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h
@@ -594,6 +594,575 @@ TRACE_EVENT(sched_wake_idle_without_ipi, TP_printk("cpu=%d", __entry->cpu) ); + +#ifdef CONFIG_SMP +#ifdef CREATE_TRACE_POINTS +static inline +int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ +#ifdef CONFIG_FAIR_GROUP_SCHED + struct rq *rq = cfs_rq ? cfs_rq->rq : NULL; +#else + struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL; +#endif + return rq ? cpu_of(rq) + : task_cpu((container_of(se, struct task_struct, se))); +} + +static inline +int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len) +{ +#ifdef CONFIG_FAIR_GROUP_SCHED + int l = path ? len : 0; + + if (cfs_rq && task_group_is_autogroup(cfs_rq->tg)) + return autogroup_path(cfs_rq->tg, path, l) + 1; + else if (cfs_rq && cfs_rq->tg->css.cgroup) + return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1; +#endif + if (path) + strcpy(path, "(null)"); + + return strlen("(null)"); +} + +static inline +struct cfs_rq *__trace_sched_group_cfs_rq(struct sched_entity *se) +{ +#ifdef CONFIG_FAIR_GROUP_SCHED + return se->my_q; +#else + return NULL; +#endif +} +#endif /* CREATE_TRACE_POINTS */ + +/* + * Tracepoint for cfs_rq load tracking: + */ +TRACE_EVENT(sched_load_cfs_rq, + + TP_PROTO(struct cfs_rq *cfs_rq), + + TP_ARGS(cfs_rq), + + TP_STRUCT__entry( + __field( int, cpu ) + __dynamic_array(char, path, + __trace_sched_path(cfs_rq, NULL, 0) ) + __field( unsigned long, load ) + __field( unsigned long, util ) + ), + + TP_fast_assign( + __entry->cpu = __trace_sched_cpu(cfs_rq, NULL); + __trace_sched_path(cfs_rq, __get_dynamic_array(path), + __get_dynamic_array_len(path)); + __entry->load = cfs_rq->runnable_load_avg; + __entry->util = cfs_rq->avg.util_avg; + ), + + TP_printk("cpu=%d path=%s load=%lu util=%lu", __entry->cpu, + __get_str(path), __entry->load, __entry->util) +); + +/* + * Tracepoint for rt_rq load tracking: + */ +struct rt_rq; + +TRACE_EVENT(sched_load_rt_rq, + + TP_PROTO(int cpu, struct rt_rq *rt_rq), + + TP_ARGS(cpu, rt_rq), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, util ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->util = rt_rq->avg.util_avg; + ), + + TP_printk("cpu=%d util=%lu", __entry->cpu, + __entry->util) +); + +#ifdef CONFIG_SCHED_WALT +extern unsigned int sysctl_sched_use_walt_cpu_util; +extern unsigned int sysctl_sched_use_walt_task_util; +extern unsigned int walt_ravg_window; +extern bool walt_disabled; +#endif + +/* + * Tracepoint for accounting cpu root cfs_rq + */ +TRACE_EVENT(sched_load_avg_cpu, + + TP_PROTO(int cpu, struct cfs_rq *cfs_rq), + + TP_ARGS(cpu, cfs_rq), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, load_avg ) + __field( unsigned long, util_avg ) + __field( unsigned long, util_avg_pelt ) + __field( unsigned long, util_avg_walt ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->load_avg = cfs_rq->avg.load_avg; + __entry->util_avg = cfs_rq->avg.util_avg; + __entry->util_avg_pelt = cfs_rq->avg.util_avg; + __entry->util_avg_walt = 0; +#ifdef CONFIG_SCHED_WALT + __entry->util_avg_walt = + cpu_rq(cpu)->prev_runnable_sum << SCHED_CAPACITY_SHIFT; + do_div(__entry->util_avg_walt, walt_ravg_window); + if (!walt_disabled && sysctl_sched_use_walt_cpu_util) + __entry->util_avg = __entry->util_avg_walt; +#endif + ), + + TP_printk("cpu=%d load_avg=%lu util_avg=%lu " + "util_avg_pelt=%lu util_avg_walt=%lu", + __entry->cpu, __entry->load_avg, __entry->util_avg, + __entry->util_avg_pelt, __entry->util_avg_walt) +); + + +/* + * Tracepoint for sched_entity load tracking: + */ +TRACE_EVENT(sched_load_se, + + TP_PROTO(struct sched_entity *se), + + TP_ARGS(se), + + TP_STRUCT__entry( + __field( int, cpu ) + __dynamic_array(char, path, + __trace_sched_path(__trace_sched_group_cfs_rq(se), NULL, 0) ) + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( unsigned long, load ) + __field( unsigned long, util ) + __field( unsigned long, util_pelt ) + __field( unsigned long, util_walt ) + ), + + TP_fast_assign( + struct cfs_rq *gcfs_rq = __trace_sched_group_cfs_rq(se); + struct task_struct *p = gcfs_rq ? NULL + : container_of(se, struct task_struct, se); + + __entry->cpu = __trace_sched_cpu(gcfs_rq, se); + __trace_sched_path(gcfs_rq, __get_dynamic_array(path), + __get_dynamic_array_len(path)); + memcpy(__entry->comm, p ? p->comm : "(null)", TASK_COMM_LEN); + __entry->pid = p ? p->pid : -1; + __entry->load = se->avg.load_avg; + __entry->util = se->avg.util_avg; + __entry->util_pelt = __entry->util; + __entry->util_walt = 0; +#ifdef CONFIG_SCHED_WALT + if (!se->my_q) { + struct task_struct *p = container_of(se, struct task_struct, se); + __entry->util_walt = p->ravg.demand; + do_div(__entry->util_walt, walt_ravg_window >> SCHED_CAPACITY_SHIFT); + if (!walt_disabled && sysctl_sched_use_walt_task_util) + __entry->util = __entry->util_walt; + } +#endif + ), + + TP_printk("cpu=%d path=%s comm=%s pid=%d load=%lu util=%lu util_pelt=%lu util_walt=%lu", + __entry->cpu, __get_str(path), __entry->comm, + __entry->pid, __entry->load, __entry->util, + __entry->util_pelt, __entry->util_walt) +); + +/* + * Tracepoint for task_group load tracking: + */ +#ifdef CONFIG_FAIR_GROUP_SCHED +TRACE_EVENT(sched_load_tg, + + TP_PROTO(struct cfs_rq *cfs_rq), + + TP_ARGS(cfs_rq), + + TP_STRUCT__entry( + __field( int, cpu ) + __dynamic_array(char, path, + __trace_sched_path(cfs_rq, NULL, 0) ) + __field( long, load ) + ), + + TP_fast_assign( + __entry->cpu = cfs_rq->rq->cpu; + __trace_sched_path(cfs_rq, __get_dynamic_array(path), + __get_dynamic_array_len(path)); + __entry->load = atomic_long_read(&cfs_rq->tg->load_avg); + ), + + TP_printk("cpu=%d path=%s load=%ld", __entry->cpu, __get_str(path), + __entry->load) +); +#endif /* CONFIG_FAIR_GROUP_SCHED */ + +/* + * Tracepoint for accounting CPU boosted utilization + */ +TRACE_EVENT(sched_boost_cpu, + + TP_PROTO(int cpu, unsigned long util, long margin), + + TP_ARGS(cpu, util, margin), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( unsigned long, util ) + __field(long, margin ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->util = util; + __entry->margin = margin; + ), + + TP_printk("cpu=%d util=%lu margin=%ld", + __entry->cpu, + __entry->util, + __entry->margin) +); + +/* + * Tracepoint for schedtune_tasks_update + */ +TRACE_EVENT(sched_tune_tasks_update, + + TP_PROTO(struct task_struct *tsk, int cpu, int tasks, int idx, + int boost, int max_boost), + + TP_ARGS(tsk, cpu, tasks, idx, boost, max_boost), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, cpu ) + __field( int, tasks ) + __field( int, idx ) + __field( int, boost ) + __field( int, max_boost ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->cpu = cpu; + __entry->tasks = tasks; + __entry->idx = idx; + __entry->boost = boost; + __entry->max_boost = max_boost; + ), + + TP_printk("pid=%d comm=%s " + "cpu=%d tasks=%d idx=%d boost=%d max_boost=%d", + __entry->pid, __entry->comm, + __entry->cpu, __entry->tasks, __entry->idx, + __entry->boost, __entry->max_boost) +); + +/* + * Tracepoint for schedtune_boostgroup_update + */ +TRACE_EVENT(sched_tune_boostgroup_update, + + TP_PROTO(int cpu, int variation, int max_boost), + + TP_ARGS(cpu, variation, max_boost), + + TP_STRUCT__entry( + __field( int, cpu ) + __field( int, variation ) + __field( int, max_boost ) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->variation = variation; + __entry->max_boost = max_boost; + ), + + TP_printk("cpu=%d variation=%d max_boost=%d", + __entry->cpu, __entry->variation, __entry->max_boost) +); + +/* + * Tracepoint for accounting task boosted utilization + */ +TRACE_EVENT(sched_boost_task, + + TP_PROTO(struct task_struct *tsk, unsigned long util, long margin), + + TP_ARGS(tsk, util, margin), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( unsigned long, util ) + __field( long, margin ) + + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->util = util; + __entry->margin = margin; + ), + + TP_printk("comm=%s pid=%d util=%lu margin=%ld", + __entry->comm, __entry->pid, + __entry->util, + __entry->margin) +); + +/* + * Tracepoint for system overutilized flag + */ +struct sched_domain; +TRACE_EVENT_CONDITION(sched_overutilized, + + TP_PROTO(struct sched_domain *sd, bool was_overutilized, bool overutilized), + + TP_ARGS(sd, was_overutilized, overutilized), + + TP_CONDITION(overutilized != was_overutilized), + + TP_STRUCT__entry( + __field( bool, overutilized ) + __array( char, cpulist , 32 ) + ), + + TP_fast_assign( + __entry->overutilized = overutilized; + scnprintf(__entry->cpulist, sizeof(__entry->cpulist), "%*pbl", cpumask_pr_args(sched_domain_span(sd))); + ), + + TP_printk("overutilized=%d sd_span=%s", + __entry->overutilized ? 1 : 0, __entry->cpulist) +); + +/* + * Tracepoint for find_best_target + */ +TRACE_EVENT(sched_find_best_target, + + TP_PROTO(struct task_struct *tsk, bool prefer_idle, + unsigned long min_util, int start_cpu, + int best_idle, int best_active, int target), + + TP_ARGS(tsk, prefer_idle, min_util, start_cpu, + best_idle, best_active, target), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( unsigned long, min_util ) + __field( bool, prefer_idle ) + __field( int, start_cpu ) + __field( int, best_idle ) + __field( int, best_active ) + __field( int, target ) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->min_util = min_util; + __entry->prefer_idle = prefer_idle; + __entry->start_cpu = start_cpu; + __entry->best_idle = best_idle; + __entry->best_active = best_active; + __entry->target = target; + ), + + TP_printk("pid=%d comm=%s prefer_idle=%d start_cpu=%d " + "best_idle=%d best_active=%d target=%d", + __entry->pid, __entry->comm, + __entry->prefer_idle, __entry->start_cpu, + __entry->best_idle, __entry->best_active, + __entry->target) +); + +#ifdef CONFIG_SCHED_WALT +struct rq; + +TRACE_EVENT(walt_update_task_ravg, + + TP_PROTO(struct task_struct *p, struct rq *rq, int evt, + u64 wallclock, u64 irqtime), + + TP_ARGS(p, rq, evt, wallclock, irqtime), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, cur_pid ) + __field( u64, wallclock ) + __field( u64, mark_start ) + __field( u64, delta_m ) + __field( u64, win_start ) + __field( u64, delta ) + __field( u64, irqtime ) + __array( char, evt, 16 ) + __field(unsigned int, demand ) + __field(unsigned int, sum ) + __field( int, cpu ) + __field( u64, cs ) + __field( u64, ps ) + __field( u32, curr_window ) + __field( u32, prev_window ) + __field( u64, nt_cs ) + __field( u64, nt_ps ) + __field( u32, active_windows ) + ), + + TP_fast_assign( + static const char* walt_event_names[] = + { + "PUT_PREV_TASK", + "PICK_NEXT_TASK", + "TASK_WAKE", + "TASK_MIGRATE", + "TASK_UPDATE", + "IRQ_UPDATE" + }; + __entry->wallclock = wallclock; + __entry->win_start = rq->window_start; + __entry->delta = (wallclock - rq->window_start); + strcpy(__entry->evt, walt_event_names[evt]); + __entry->cpu = rq->cpu; + __entry->cur_pid = rq->curr->pid; + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->mark_start = p->ravg.mark_start; + __entry->delta_m = (wallclock - p->ravg.mark_start); + __entry->demand = p->ravg.demand; + __entry->sum = p->ravg.sum; + __entry->irqtime = irqtime; + __entry->cs = rq->curr_runnable_sum; + __entry->ps = rq->prev_runnable_sum; + __entry->curr_window = p->ravg.curr_window; + __entry->prev_window = p->ravg.prev_window; + __entry->nt_cs = rq->nt_curr_runnable_sum; + __entry->nt_ps = rq->nt_prev_runnable_sum; + __entry->active_windows = p->ravg.active_windows; + ), + + TP_printk("wallclock=%llu window_start=%llu delta=%llu event=%s cpu=%d cur_pid=%d pid=%d comm=%s" + " mark_start=%llu delta=%llu demand=%u sum=%u irqtime=%llu" + " curr_runnable_sum=%llu prev_runnable_sum=%llu cur_window=%u" + " prev_window=%u nt_curr_runnable_sum=%llu nt_prev_runnable_sum=%llu active_windows=%u", + __entry->wallclock, __entry->win_start, __entry->delta, + __entry->evt, __entry->cpu, __entry->cur_pid, + __entry->pid, __entry->comm, __entry->mark_start, + __entry->delta_m, __entry->demand, + __entry->sum, __entry->irqtime, + __entry->cs, __entry->ps, + __entry->curr_window, __entry->prev_window, + __entry->nt_cs, __entry->nt_ps, + __entry->active_windows + ) +); + +TRACE_EVENT(walt_update_history, + + TP_PROTO(struct rq *rq, struct task_struct *p, u32 runtime, int samples, + int evt), + + TP_ARGS(rq, p, runtime, samples, evt), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field(unsigned int, runtime ) + __field( int, samples ) + __field( int, evt ) + __field( u64, demand ) + __field(unsigned int, walt_avg ) + __field(unsigned int, pelt_avg ) + __array( u32, hist, RAVG_HIST_SIZE_MAX) + __field( int, cpu ) + ), + + TP_fast_assign( + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->runtime = runtime; + __entry->samples = samples; + __entry->evt = evt; + __entry->demand = p->ravg.demand; + __entry->walt_avg = (__entry->demand << 10) / walt_ravg_window, + __entry->pelt_avg = p->se.avg.util_avg; + memcpy(__entry->hist, p->ravg.sum_history, + RAVG_HIST_SIZE_MAX * sizeof(u32)); + __entry->cpu = rq->cpu; + ), + + TP_printk("pid=%d comm=%s runtime=%u samples=%d event=%d demand=%llu ravg_window=%u" + " walt=%u pelt=%u hist0=%u hist1=%u hist2=%u hist3=%u hist4=%u cpu=%d", + __entry->pid, __entry->comm, + __entry->runtime, __entry->samples, __entry->evt, + __entry->demand, + walt_ravg_window, + __entry->walt_avg, + __entry->pelt_avg, + __entry->hist[0], __entry->hist[1], + __entry->hist[2], __entry->hist[3], + __entry->hist[4], __entry->cpu) +); + +TRACE_EVENT(walt_migration_update_sum, + + TP_PROTO(struct rq *rq, struct task_struct *p), + + TP_ARGS(rq, p), + + TP_STRUCT__entry( + __field(int, cpu ) + __field(int, pid ) + __field( u64, cs ) + __field( u64, ps ) + __field( s64, nt_cs ) + __field( s64, nt_ps ) + ), + + TP_fast_assign( + __entry->cpu = cpu_of(rq); + __entry->cs = rq->curr_runnable_sum; + __entry->ps = rq->prev_runnable_sum; + __entry->nt_cs = (s64)rq->nt_curr_runnable_sum; + __entry->nt_ps = (s64)rq->nt_prev_runnable_sum; + __entry->pid = p->pid; + ), + + TP_printk("cpu=%d curr_runnable_sum=%llu prev_runnable_sum=%llu nt_curr_runnable_sum=%lld nt_prev_runnable_sum=%lld pid=%d", + __entry->cpu, __entry->cs, __entry->ps, + __entry->nt_cs, __entry->nt_ps, __entry->pid) +); +#endif /* CONFIG_SCHED_WALT */ +#endif /* CONFIG_SMP */ #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */
diff --git a/init/Kconfig b/init/Kconfig index 3c1faaa..09eb610 100644 --- a/init/Kconfig +++ b/init/Kconfig
@@ -400,6 +400,15 @@ If in doubt, say N here. +config SCHED_WALT + bool "Support window based load tracking" + depends on SMP + help + This feature will allow the scheduler to maintain a tunable window + based set of metrics for tasks and runqueues. These metrics can be + used to guide task placement as well as task frequency requirements + for cpufreq governors. + config BSD_PROCESS_ACCT bool "BSD Process Accounting" depends on MULTIUSER @@ -959,6 +968,39 @@ desktop applications. Task group autogeneration is currently based upon task session. +config SCHED_TUNE + bool "Boosting for CFS tasks (EXPERIMENTAL)" + depends on SMP + help + This option enables support for task classification using a new + cgroup controller, schedtune. Schedtune allows tasks to be given + a boost value and marked as latency-sensitive or not. This option + provides the "schedtune" controller. + + This new controller: + 1. allows only a two layers hierarchy, where the root defines the + system-wide boost value and its direct childrens define each one a + different "class of tasks" to be boosted with a different value + 2. supports up to 16 different task classes, each one which could be + configured with a different boost value + + Latency-sensitive tasks are not subject to energy-aware wakeup + task placement. The boost value assigned to tasks is used to + influence task placement and CPU frequency selection (if + utilization-driven frequency selection is in use). + + If unsure, say N. + +config DEFAULT_USE_ENERGY_AWARE + bool "Default to enabling the Energy Aware Scheduler feature" + default n + help + This option defaults the ENERGY_AWARE scheduling feature to true, + as without SCHED_DEBUG set this feature can't be enabled or disabled + via sysctl. + + Say N if unsure. + config SYSFS_DEPRECATED bool "Enable deprecated sysfs features to support old userspace tools" depends on SYSFS
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index a9ee16b..b9207a9 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile
@@ -20,9 +20,12 @@ obj-y += idle_task.o fair.o rt.o deadline.o obj-y += wait.o wait_bit.o swait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o +obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += energy.o +obj-$(CONFIG_SCHED_WALT) += walt.o obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o +obj-$(CONFIG_SCHED_TUNE) += tune.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_CPU_FREQ) += cpufreq.o obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index a43df51..c80a47a 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c
@@ -260,7 +260,6 @@ void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m) } #endif /* CONFIG_PROC_FS */ -#ifdef CONFIG_SCHED_DEBUG int autogroup_path(struct task_group *tg, char *buf, int buflen) { if (!task_group_is_autogroup(tg)) @@ -268,4 +267,3 @@ int autogroup_path(struct task_group *tg, char *buf, int buflen) return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id); } -#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched/autogroup.h b/kernel/sched/autogroup.h index 27cd22b..590184b 100644 --- a/kernel/sched/autogroup.h +++ b/kernel/sched/autogroup.h
@@ -56,11 +56,9 @@ autogroup_task_group(struct task_struct *p, struct task_group *tg) return tg; } -#ifdef CONFIG_SCHED_DEBUG static inline int autogroup_path(struct task_group *tg, char *buf, int buflen) { return 0; } -#endif #endif /* CONFIG_SCHED_AUTOGROUP */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da..00c6f578 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c
@@ -39,6 +39,7 @@ #define CREATE_TRACE_POINTS #include <trace/events/sched.h> +#include "walt.h" DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); @@ -438,6 +439,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) return; + head->count++; + get_task_struct(task); /* @@ -447,6 +450,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task) head->lastp = &node->next; } +static int +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags, + int sibling_count_hint); + void wake_up_q(struct wake_q_head *head) { struct wake_q_node *node = head->first; @@ -461,10 +468,10 @@ void wake_up_q(struct wake_q_head *head) task->wake_q.next = NULL; /* - * wake_up_process() implies a wmb() to pair with the queueing + * try_to_wake_up() implies a wmb() to pair with the queueing * in wake_q_add() so as not to miss wakeups. */ - wake_up_process(task); + try_to_wake_up(task, TASK_NORMAL, 0, head->count); put_task_struct(task); } } @@ -1186,6 +1193,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) p->sched_class->migrate_task_rq(p); p->se.nr_migrations++; perf_event_task_migrate(p); + + walt_fixup_busy_time(p, new_cpu); } __set_task_cpu(p, new_cpu); @@ -1536,12 +1545,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p) * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable. */ static inline -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags, + int sibling_count_hint) { lockdep_assert_held(&p->pi_lock); if (p->nr_cpus_allowed > 1) - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags); + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags, + sibling_count_hint); else cpu = cpumask_any(&p->cpus_allowed); @@ -1948,11 +1959,33 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) * */ +#ifdef CONFIG_SMP +#ifdef CONFIG_SCHED_WALT +/* utility function to update walt signals at wakeup */ +static inline void walt_try_to_wake_up(struct task_struct *p) +{ + struct rq *rq = cpu_rq(task_cpu(p)); + struct rq_flags rf; + u64 wallclock; + + rq_lock_irqsave(rq, &rf); + wallclock = walt_ktime_clock(); + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, wallclock, 0); + walt_update_task_ravg(p, rq, TASK_WAKE, wallclock, 0); + rq_unlock_irqrestore(rq, &rf); +} +#else +#define walt_try_to_wake_up(a) {} +#endif +#endif + /** * try_to_wake_up - wake up a thread * @p: the thread to be awakened * @state: the mask of task states that can be woken * @wake_flags: wake modifier flags (WF_*) + * @sibling_count_hint: A hint at the number of threads that are being woken up + * in this event. * * If (@state & @p->state) @p->state = TASK_RUNNING. * @@ -1965,7 +1998,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) * %false otherwise. */ static int -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags, + int sibling_count_hint) { unsigned long flags; int cpu, success = 0; @@ -2043,6 +2077,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_load_acquire(&p->on_cpu, !VAL); + walt_try_to_wake_up(p); + p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; @@ -2051,7 +2087,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) atomic_dec(&task_rq(p)->nr_iowait); } - cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags, + sibling_count_hint); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; set_task_cpu(p, cpu); @@ -2112,6 +2149,11 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) trace_sched_waking(p); if (!task_on_rq_queued(p)) { + u64 wallclock = walt_ktime_clock(); + + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, wallclock, 0); + walt_update_task_ravg(p, rq, TASK_WAKE, wallclock, 0); + if (p->in_iowait) { delayacct_blkio_end(); atomic_dec(&rq->nr_iowait); @@ -2139,13 +2181,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) */ int wake_up_process(struct task_struct *p) { - return try_to_wake_up(p, TASK_NORMAL, 0); + return try_to_wake_up(p, TASK_NORMAL, 0, 1); } EXPORT_SYMBOL(wake_up_process); int wake_up_state(struct task_struct *p, unsigned int state) { - return try_to_wake_up(p, state, 0); + return try_to_wake_up(p, state, 0, 1); } /* @@ -2164,7 +2206,12 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.prev_sum_exec_runtime = 0; p->se.nr_migrations = 0; p->se.vruntime = 0; +#ifdef CONFIG_SCHED_WALT + p->last_sleep_ts = 0; +#endif + INIT_LIST_HEAD(&p->se.group_node); + walt_init_new_task_load(p); #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq = NULL; @@ -2441,6 +2488,9 @@ void wake_up_new_task(struct task_struct *p) struct rq *rq; raw_spin_lock_irqsave(&p->pi_lock, rf.flags); + + walt_init_new_task_load(p); + p->state = TASK_RUNNING; #ifdef CONFIG_SMP /* @@ -2451,13 +2501,15 @@ void wake_up_new_task(struct task_struct *p) * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq, * as we're not fully set-up yet. */ - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); + __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1)); #endif rq = __task_rq_lock(p, &rf); update_rq_clock(rq); post_init_entity_util_avg(&p->se); activate_task(rq, p, ENQUEUE_NOCLOCK); + walt_mark_task_starting(p); + p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); @@ -2912,7 +2964,7 @@ void sched_exec(void) int dest_cpu; raw_spin_lock_irqsave(&p->pi_lock, flags); - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0); + dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1); if (dest_cpu == smp_processor_id()) goto unlock; @@ -3011,6 +3063,9 @@ void scheduler_tick(void) rq_lock(rq, &rf); + walt_set_window_start(rq, &rf); + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, + walt_ktime_clock(), 0); update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); @@ -3282,6 +3337,7 @@ static void __sched notrace __schedule(bool preempt) struct rq_flags rf; struct rq *rq; int cpu; + u64 wallclock; cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -3337,10 +3393,17 @@ static void __sched notrace __schedule(bool preempt) } next = pick_next_task(rq, prev, &rf); + wallclock = walt_ktime_clock(); + walt_update_task_ravg(prev, rq, PUT_PREV_TASK, wallclock, 0); + walt_update_task_ravg(next, rq, PICK_NEXT_TASK, wallclock, 0); clear_tsk_need_resched(prev); clear_preempt_need_resched(); if (likely(prev != next)) { +#ifdef CONFIG_SCHED_WALT + if (!prev->on_rq) + prev->last_sleep_ts = wallclock; +#endif rq->nr_switches++; rq->curr = next; /* @@ -3616,7 +3679,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags, void *key) { - return try_to_wake_up(curr->private, mode, wake_flags); + return try_to_wake_up(curr->private, mode, wake_flags, 1); } EXPORT_SYMBOL(default_wake_function); @@ -5692,6 +5755,9 @@ int sched_cpu_dying(unsigned int cpu) sched_ttwu_pending(); rq_lock_irqsave(rq, &rf); + + walt_migrate_sync_cpu(cpu); + if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); set_rq_offline(rq); @@ -5917,12 +5983,18 @@ void __init sched_init(void) rq->idle_stamp = 0; rq->avg_idle = 2*sysctl_sched_migration_cost; rq->max_idle_balance_cost = sysctl_sched_migration_cost; +#ifdef CONFIG_SCHED_WALT + rq->cur_irqload = 0; + rq->avg_irqload = 0; + rq->irqload_ts = 0; +#endif INIT_LIST_HEAD(&rq->cfs_tasks); rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON rq->last_load_update_tick = jiffies; + rq->last_blocked_load_update_tick = jiffies; rq->nohz_flags = 0; #endif #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index ba0da24..9d946e6 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c
@@ -19,11 +19,14 @@ #include "sched.h" +unsigned long boosted_cpu_util(int cpu); + #define SUGOV_KTHREAD_PRIORITY 50 struct sugov_tunables { struct gov_attr_set attr_set; - unsigned int rate_limit_us; + unsigned int up_rate_limit_us; + unsigned int down_rate_limit_us; }; struct sugov_policy { @@ -34,7 +37,9 @@ struct sugov_policy { raw_spinlock_t update_lock; /* For shared policies */ u64 last_freq_update_time; - s64 freq_update_delay_ns; + s64 min_rate_limit_ns; + s64 up_rate_delay_ns; + s64 down_rate_delay_ns; unsigned int next_freq; unsigned int cached_raw_freq; @@ -111,8 +116,32 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) return true; } + /* No need to recalculate next freq for min_rate_limit_us + * at least. However we might still decide to further rate + * limit once frequency change direction is decided, according + * to the separate rate limits. + */ + delta_ns = time - sg_policy->last_freq_update_time; - return delta_ns >= sg_policy->freq_update_delay_ns; + return delta_ns >= sg_policy->min_rate_limit_ns; +} + +static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + s64 delta_ns; + + delta_ns = time - sg_policy->last_freq_update_time; + + if (next_freq > sg_policy->next_freq && + delta_ns < sg_policy->up_rate_delay_ns) + return true; + + if (next_freq < sg_policy->next_freq && + delta_ns < sg_policy->down_rate_delay_ns) + return true; + + return false; } static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, @@ -123,6 +152,9 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, if (sg_policy->next_freq == next_freq) return; + if (sugov_up_down_rate_limit(sg_policy, time, next_freq)) + return; + sg_policy->next_freq = next_freq; sg_policy->last_freq_update_time = time; @@ -178,13 +210,15 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu) { - struct rq *rq = cpu_rq(cpu); - unsigned long cfs_max; + unsigned long max_cap, rt; - cfs_max = arch_scale_cpu_capacity(NULL, cpu); + max_cap = arch_scale_cpu_capacity(NULL, cpu); - *util = min(rq->cfs.avg.util_avg, cfs_max); - *max = cfs_max; + rt = sched_get_rt_rq_util(cpu); + + *util = boosted_cpu_util(cpu) + rt; + *util = min(*util, max_cap); + *max = max_cap; } static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, @@ -272,7 +306,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, busy = sugov_cpu_is_busy(sg_cpu); - if (flags & SCHED_CPUFREQ_RT_DL) { + if (flags & SCHED_CPUFREQ_DL) { next_f = policy->cpuinfo.max_freq; } else { sugov_get_util(&util, &max, sg_cpu->cpu); @@ -285,6 +319,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, if (busy && next_f < sg_policy->next_freq) next_f = sg_policy->next_freq; } + sugov_update_commit(sg_policy, time, next_f); } @@ -313,7 +348,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) j_sg_cpu->iowait_boost_pending = false; continue; } - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) + if (j_sg_cpu->flags & SCHED_CPUFREQ_DL) return policy->cpuinfo.max_freq; j_util = j_sg_cpu->util; @@ -349,7 +384,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, sg_cpu->last_update = time; if (sugov_should_update_freq(sg_policy, time)) { - if (flags & SCHED_CPUFREQ_RT_DL) + if (flags & SCHED_CPUFREQ_DL) next_f = sg_policy->policy->cpuinfo.max_freq; else next_f = sugov_next_freq_shared(sg_cpu, time); @@ -404,15 +439,32 @@ static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr return container_of(attr_set, struct sugov_tunables, attr_set); } -static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +static DEFINE_MUTEX(min_rate_lock); + +static void update_min_rate_limit_ns(struct sugov_policy *sg_policy) +{ + mutex_lock(&min_rate_lock); + sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns, + sg_policy->down_rate_delay_ns); + mutex_unlock(&min_rate_lock); +} + +static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) { struct sugov_tunables *tunables = to_sugov_tunables(attr_set); - return sprintf(buf, "%u\n", tunables->rate_limit_us); + return sprintf(buf, "%u\n", tunables->up_rate_limit_us); } -static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, - size_t count) +static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->down_rate_limit_us); +} + +static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) { struct sugov_tunables *tunables = to_sugov_tunables(attr_set); struct sugov_policy *sg_policy; @@ -421,18 +473,42 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu if (kstrtouint(buf, 10, &rate_limit_us)) return -EINVAL; - tunables->rate_limit_us = rate_limit_us; + tunables->up_rate_limit_us = rate_limit_us; - list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) - sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC; + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_ns(sg_policy); + } return count; } -static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); +static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + + if (kstrtouint(buf, 10, &rate_limit_us)) + return -EINVAL; + + tunables->down_rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_ns(sg_policy); + } + + return count; +} + +static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us); +static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us); static struct attribute *sugov_attributes[] = { - &rate_limit_us.attr, + &up_rate_limit_us.attr, + &down_rate_limit_us.attr, NULL }; @@ -579,7 +655,8 @@ static int sugov_init(struct cpufreq_policy *policy) goto stop_kthread; } - tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy); + tunables->up_rate_limit_us = cpufreq_policy_transition_delay_us(policy); + tunables->down_rate_limit_us = cpufreq_policy_transition_delay_us(policy); policy->governor_data = sg_policy; sg_policy->tunables = tunables; @@ -638,7 +715,11 @@ static int sugov_start(struct cpufreq_policy *policy) struct sugov_policy *sg_policy = policy->governor_data; unsigned int cpu; - sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC; + sg_policy->up_rate_delay_ns = + sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC; + sg_policy->down_rate_delay_ns = + sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_ns(sg_policy); sg_policy->last_freq_update_time = 0; sg_policy->next_freq = UINT_MAX; sg_policy->work_in_progress = false; @@ -651,7 +732,7 @@ static int sugov_start(struct cpufreq_policy *policy) memset(sg_cpu, 0, sizeof(*sg_cpu)); sg_cpu->cpu = cpu; sg_cpu->sg_policy = sg_policy; - sg_cpu->flags = SCHED_CPUFREQ_RT; + sg_cpu->flags = SCHED_CPUFREQ_DL; sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq; }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 14d2dbf..029b505 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c
@@ -6,6 +6,7 @@ #include <linux/context_tracking.h> #include <linux/sched/cputime.h> #include "sched.h" +#include "walt.h" #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@ -55,11 +56,18 @@ void irqtime_account_irq(struct task_struct *curr) struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime); s64 delta; int cpu; +#ifdef CONFIG_SCHED_WALT + u64 wallclock; + bool account = true; +#endif if (!sched_clock_irqtime) return; cpu = smp_processor_id(); +#ifdef CONFIG_SCHED_WALT + wallclock = sched_clock_cpu(cpu); +#endif delta = sched_clock_cpu(cpu) - irqtime->irq_start_time; irqtime->irq_start_time += delta; @@ -73,6 +81,13 @@ void irqtime_account_irq(struct task_struct *curr) irqtime_account_delta(irqtime, delta, CPUTIME_IRQ); else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ); +#ifdef CONFIG_SCHED_WALT + else + account = false; + + if (account) + walt_account_irqtime(cpu, curr, delta, wallclock); +#endif } EXPORT_SYMBOL_GPL(irqtime_account_irq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 4ae5c1e..f982a3f 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c
@@ -20,6 +20,8 @@ #include <linux/slab.h> #include <uapi/linux/sched/types.h> +#include "walt.h" + struct dl_bandwidth def_dl_bandwidth; static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se) @@ -1290,6 +1292,7 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) WARN_ON(!dl_prio(prio)); dl_rq->dl_nr_running++; add_nr_running(rq_of_dl_rq(dl_rq), 1); + walt_inc_cumulative_runnable_avg(rq_of_dl_rq(dl_rq), dl_task_of(dl_se)); inc_dl_deadline(dl_rq, deadline); inc_dl_migration(dl_se, dl_rq); @@ -1304,6 +1307,7 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) WARN_ON(!dl_rq->dl_nr_running); dl_rq->dl_nr_running--; sub_nr_running(rq_of_dl_rq(dl_rq), 1); + walt_dec_cumulative_runnable_avg(rq_of_dl_rq(dl_rq), dl_task_of(dl_se)); dec_dl_deadline(dl_rq, dl_se->deadline); dec_dl_migration(dl_se, dl_rq); @@ -1505,7 +1509,8 @@ static void yield_task_dl(struct rq *rq) static int find_later_rq(struct task_struct *task); static int -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags, + int sibling_count_hint) { struct task_struct *curr; struct rq *rq; @@ -2017,7 +2022,9 @@ static int push_dl_task(struct rq *rq) deactivate_task(rq, next_task, 0); sub_running_bw(next_task->dl.dl_bw, &rq->dl); sub_rq_bw(next_task->dl.dl_bw, &rq->dl); + next_task->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(next_task, later_rq->cpu); + next_task->on_rq = TASK_ON_RQ_QUEUED; add_rq_bw(next_task->dl.dl_bw, &later_rq->dl); add_running_bw(next_task->dl.dl_bw, &later_rq->dl); activate_task(later_rq, next_task, 0); @@ -2109,7 +2116,9 @@ static void pull_dl_task(struct rq *this_rq) deactivate_task(src_rq, p, 0); sub_running_bw(p->dl.dl_bw, &src_rq->dl); sub_rq_bw(p->dl.dl_bw, &src_rq->dl); + p->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(p, this_cpu); + p->on_rq = TASK_ON_RQ_QUEUED; add_rq_bw(p->dl.dl_bw, &this_rq->dl); add_running_bw(p->dl.dl_bw, &this_rq->dl); activate_task(this_rq, p, 0);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 2f93e4a..0a93f25 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c
@@ -267,9 +267,60 @@ set_table_entry(struct ctl_table *entry, } static struct ctl_table * +sd_alloc_ctl_energy_table(struct sched_group_energy *sge) +{ + struct ctl_table *table = sd_alloc_ctl_entry(5); + + if (table == NULL) + return NULL; + + set_table_entry(&table[0], "nr_idle_states", &sge->nr_idle_states, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[1], "idle_states", &sge->idle_states[0].power, + sge->nr_idle_states*sizeof(struct idle_state), 0644, + proc_doulongvec_minmax, false); + set_table_entry(&table[2], "nr_cap_states", &sge->nr_cap_states, + sizeof(int), 0644, proc_dointvec_minmax, false); + set_table_entry(&table[3], "cap_states", &sge->cap_states[0].cap, + sge->nr_cap_states*sizeof(struct capacity_state), 0644, + proc_doulongvec_minmax, false); + + return table; +} + +static struct ctl_table * +sd_alloc_ctl_group_table(struct sched_group *sg) +{ + struct ctl_table *table = sd_alloc_ctl_entry(2); + + if (table == NULL) + return NULL; + + table->procname = kstrdup("energy", GFP_KERNEL); + table->mode = 0555; + table->child = sd_alloc_ctl_energy_table((struct sched_group_energy *)sg->sge); + + return table; +} + +static struct ctl_table * sd_alloc_ctl_domain_table(struct sched_domain *sd) { - struct ctl_table *table = sd_alloc_ctl_entry(14); + struct ctl_table *table; + unsigned int nr_entries = 14; + + int i = 0; + struct sched_group *sg = sd->groups; + + if (sg->sge) { + int nr_sgs = 0; + + do {} while (nr_sgs++, sg = sg->next, sg != sd->groups); + + nr_entries += nr_sgs; + } + + table = sd_alloc_ctl_entry(nr_entries); if (table == NULL) return NULL; @@ -302,7 +353,19 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd) sizeof(long), 0644, proc_doulongvec_minmax, false); set_table_entry(&table[12], "name", sd->name, CORENAME_MAX_SIZE, 0444, proc_dostring, false); - /* &table[13] is terminator */ + sg = sd->groups; + if (sg->sge) { + char buf[32]; + struct ctl_table *entry = &table[13]; + + do { + snprintf(buf, 32, "group%d", i); + entry->procname = kstrdup(buf, GFP_KERNEL); + entry->mode = 0555; + entry->child = sd_alloc_ctl_group_table(sg); + } while (entry++, i++, sg = sg->next, sg != sd->groups); + } + /* &table[nr_entries-1] is terminator */ return table; }
diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c new file mode 100644 index 0000000..e82248a --- /dev/null +++ b/kernel/sched/energy.c
@@ -0,0 +1,145 @@ +/* + * Obtain energy cost data from DT and populate relevant scheduler data + * structures. + * + * Copyright (C) 2015 ARM Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#define pr_fmt(fmt) "sched-energy: " fmt + +#define DEBUG + +#include <linux/gfp.h> +#include <linux/of.h> +#include <linux/printk.h> +#include <linux/sched.h> +#include <linux/sched/topology.h> +#include <linux/sched_energy.h> +#include <linux/stddef.h> +#include <linux/arch_topology.h> + +struct sched_group_energy *sge_array[NR_CPUS][NR_SD_LEVELS]; + +static void free_resources(void) +{ + int cpu, sd_level; + struct sched_group_energy *sge; + + for_each_possible_cpu(cpu) { + for_each_possible_sd_level(sd_level) { + sge = sge_array[cpu][sd_level]; + if (sge) { + kfree(sge->cap_states); + kfree(sge->idle_states); + kfree(sge); + } + } + } +} + +static inline unsigned long cpu_max_capacity(int cpu) +{ + if (!sge_array[cpu][0]->cap_states) + return 1024; + if (!sge_array[cpu][0]->nr_cap_states) + return 1024; + + return sge_array[cpu][0]->cap_states[sge_array[cpu][0]->nr_cap_states-1].cap; +} + +int sched_energy_installed(int cpu) +{ + return (sge_array[cpu][0]->cap_states != NULL); +} + +void init_sched_energy_costs(void) +{ + struct device_node *cn, *cp; + struct capacity_state *cap_states; + struct idle_state *idle_states; + struct sched_group_energy *sge; + const struct property *prop; + int sd_level, i, nstates, cpu; + const __be32 *val; + + for_each_possible_cpu(cpu) { + cn = of_get_cpu_node(cpu, NULL); + if (!cn) { + pr_warn("CPU device node missing for CPU %d\n", cpu); + return; + } + + if (!of_find_property(cn, "sched-energy-costs", NULL)) { + pr_warn("CPU device node has no sched-energy-costs\n"); + return; + } + + for_each_possible_sd_level(sd_level) { + cp = of_parse_phandle(cn, "sched-energy-costs", sd_level); + if (!cp) + break; + + prop = of_find_property(cp, "busy-cost-data", NULL); + if (!prop || !prop->value) { + pr_warn("No busy-cost data, skipping sched_energy init\n"); + goto out; + } + + sge = kcalloc(1, sizeof(struct sched_group_energy), + GFP_NOWAIT); + + nstates = (prop->length / sizeof(u32)) / 2; + cap_states = kcalloc(nstates, + sizeof(struct capacity_state), + GFP_NOWAIT); + + for (i = 0, val = prop->value; i < nstates; i++) { + cap_states[i].cap = be32_to_cpup(val++); + cap_states[i].power = be32_to_cpup(val++); + } + + sge->nr_cap_states = nstates; + sge->cap_states = cap_states; + + prop = of_find_property(cp, "idle-cost-data", NULL); + if (!prop || !prop->value) { + pr_warn("No idle-cost data, skipping sched_energy init\n"); + goto out; + } + + nstates = (prop->length / sizeof(u32)); + idle_states = kcalloc(nstates, + sizeof(struct idle_state), + GFP_NOWAIT); + + for (i = 0, val = prop->value; i < nstates; i++) + idle_states[i].power = be32_to_cpup(val++); + + sge->nr_idle_states = nstates; + sge->idle_states = idle_states; + + sge_array[cpu][sd_level] = sge; + + /* populate cpu scale so that flags get set correctly */ + if (sd_level == 0) + topology_set_cpu_scale(cpu, cpu_max_capacity(cpu)); + } + } + + pr_info("Sched-energy-costs installed from DT\n"); + return; + +out: + free_resources(); +}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9be9d8..541f4826 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c
@@ -37,6 +37,8 @@ #include <trace/events/sched.h> #include "sched.h" +#include "tune.h" +#include "walt.h" /* * Targeted preemption latency for CPU-bound tasks: @@ -55,6 +57,15 @@ unsigned int sysctl_sched_latency = 6000000ULL; unsigned int normalized_sysctl_sched_latency = 6000000ULL; /* + * Enable/disable honoring sync flag in energy-aware wakeups. + */ +unsigned int sysctl_sched_sync_hint_enable = 1; +/* + * Enable/disable using cstate knowledge in idle sibling selection + */ +unsigned int sysctl_sched_cstate_aware = 1; + +/* * The initial- and re-scaling of tunables is configurable * * Options are: @@ -100,6 +111,13 @@ unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +#ifdef CONFIG_SCHED_WALT +unsigned int sysctl_sched_use_walt_cpu_util = 1; +unsigned int sysctl_sched_use_walt_task_util = 1; +__read_mostly unsigned int sysctl_sched_walt_cpu_high_irqload = + (10 * NSEC_PER_MSEC); +#endif + #ifdef CONFIG_SMP /* * For asym packing, by default the lower numbered cpu has higher priority. @@ -1431,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(struct rq *rq); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); /* Cached statistics for all CPUs within a node */ struct numa_stats { @@ -2812,6 +2829,9 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) * See cpu_util(). */ cpufreq_update_util(rq, 0); +#ifdef CONFIG_SMP + trace_sched_load_avg_cpu(cpu_of(rq), cfs_rq); +#endif } } @@ -2968,7 +2988,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, */ static __always_inline int ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, - unsigned long weight, int running, struct cfs_rq *cfs_rq) + unsigned long weight, int running, struct cfs_rq *cfs_rq, + struct rt_rq *rt_rq) { u64 delta; @@ -3024,13 +3045,22 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, } sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); + if (cfs_rq) + trace_sched_load_cfs_rq(cfs_rq); + else { + if (likely(!rt_rq)) + trace_sched_load_se(container_of(sa, struct sched_entity, avg)); + else + trace_sched_load_rt_rq(cpu, rt_rq); + } + return 1; } static int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) { - return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL); + return ___update_load_avg(now, cpu, &se->avg, 0, 0, NULL, NULL); } static int @@ -3038,7 +3068,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit { return ___update_load_avg(now, cpu, &se->avg, se->on_rq * scale_load_down(se->load.weight), - cfs_rq->curr == se, NULL); + cfs_rq->curr == se, NULL, NULL); } static int @@ -3046,7 +3076,7 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) { return ___update_load_avg(now, cpu, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), - cfs_rq->curr != NULL, cfs_rq); + cfs_rq->curr != NULL, cfs_rq, NULL); } /* @@ -3099,6 +3129,8 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; } + + trace_sched_load_tg(cfs_rq); } /* @@ -3267,6 +3299,9 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) update_tg_cfs_util(cfs_rq, se); update_tg_cfs_load(cfs_rq, se); + trace_sched_load_cfs_rq(cfs_rq); + trace_sched_load_se(se); + return 1; } @@ -3381,6 +3416,21 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) return decayed || removed_load; } +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running) +{ + int ret; + + ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL, rt_rq); + + return ret; +} + +unsigned long sched_get_rt_rq_util(int cpu) +{ + struct rt_rq *rt_rq = &(cpu_rq(cpu)->rt); + return rt_rq->avg.util_avg; +} + /* * Optional action to be done while updating the load average */ @@ -3428,6 +3478,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s set_tg_cfs_propagate(cfs_rq); cfs_rq_util_change(cfs_rq); + + trace_sched_load_cfs_rq(cfs_rq); } /** @@ -3448,6 +3500,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s set_tg_cfs_propagate(cfs_rq); cfs_rq_util_change(cfs_rq); + + trace_sched_load_cfs_rq(cfs_rq); } /* Add the load generated by se into cfs_rq's load average */ @@ -3552,6 +3606,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) return 0; } +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running) +{ + return 0; +} + #define UPDATE_TG 0x0 #define SKIP_AGE_LOAD 0x0 @@ -4871,6 +4930,48 @@ static inline void hrtick_update(struct rq *rq) } #endif +#ifdef CONFIG_SMP +static bool cpu_overutilized(int cpu); + +static unsigned long cpu_util(int cpu); + +static bool sd_overutilized(struct sched_domain *sd) +{ + return sd->shared->overutilized; +} + +static void set_sd_overutilized(struct sched_domain *sd) +{ + trace_sched_overutilized(sd, sd->shared->overutilized, true); + sd->shared->overutilized = true; +} + +static void clear_sd_overutilized(struct sched_domain *sd) +{ + trace_sched_overutilized(sd, sd->shared->overutilized, false); + sd->shared->overutilized = false; +} + +static inline void update_overutilized_status(struct rq *rq) +{ + struct sched_domain *sd; + + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (sd && !sd_overutilized(sd) && + cpu_overutilized(rq->cpu)) + set_sd_overutilized(sd); + rcu_read_unlock(); +} + +unsigned long boosted_cpu_util(int cpu); +#else + +#define update_overutilized_status(rq) do {} while (0) +#define boosted_cpu_util(cpu) cpu_util_freq(cpu) + +#endif /* CONFIG_SMP */ + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4881,6 +4982,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + int task_new = !(flags & ENQUEUE_WAKEUP); /* * If in_iowait is set, the code below may not trigger any cpufreq @@ -4905,6 +5007,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (cfs_rq_throttled(cfs_rq)) break; cfs_rq->h_nr_running++; + walt_inc_cfs_cumulative_runnable_avg(cfs_rq, p); flags = ENQUEUE_WAKEUP; } @@ -4912,6 +5015,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); cfs_rq->h_nr_running++; + walt_inc_cfs_cumulative_runnable_avg(cfs_rq, p); if (cfs_rq_throttled(cfs_rq)) break; @@ -4920,8 +5024,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(se); } - if (!se) + /* + * Update SchedTune accounting. + * + * We do it before updating the CPU capacity to ensure the + * boost value of the current task is accounted for in the + * selection of the OPP. + * + * We do it also in the case where we enqueue a throttled task; + * we could argue that a throttled task should not boost a CPU, + * however: + * a) properly implementing CPU boosting considering throttled + * tasks will increase a lot the complexity of the solution + * b) it's not easy to quantify the benefits introduced by + * such a more complex solution. + * Thus, for the time being we go for the simple solution and boost + * also for throttled RQs. + */ + schedtune_enqueue_task(p, cpu_of(rq)); + + if (!se) { add_nr_running(rq, 1); + if (!task_new) + update_overutilized_status(rq); + walt_inc_cumulative_runnable_avg(rq, p); + } hrtick_update(rq); } @@ -4952,6 +5079,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (cfs_rq_throttled(cfs_rq)) break; cfs_rq->h_nr_running--; + walt_dec_cfs_cumulative_runnable_avg(cfs_rq, p); /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -4971,6 +5099,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); cfs_rq->h_nr_running--; + walt_dec_cfs_cumulative_runnable_avg(cfs_rq, p); if (cfs_rq_throttled(cfs_rq)) break; @@ -4979,8 +5108,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(se); } - if (!se) + /* + * Update SchedTune accounting + * + * We do it before updating the CPU capacity to ensure the + * boost value of the current task is accounted for in the + * selection of the OPP. + */ + schedtune_dequeue_task(p, cpu_of(rq)); + + if (!se) { sub_nr_running(rq, 1); + walt_dec_cumulative_runnable_avg(rq, p); + } hrtick_update(rq); } @@ -5289,16 +5429,6 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); } -static unsigned long capacity_of(int cpu) -{ - return cpu_rq(cpu)->cpu_capacity; -} - -static unsigned long capacity_orig_of(int cpu) -{ - return cpu_rq(cpu)->cpu_capacity_orig; -} - static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5329,6 +5459,644 @@ static void record_wakee(struct task_struct *p) } /* + * Returns the current capacity of cpu after applying both + * cpu and freq scaling. + */ +unsigned long capacity_curr_of(int cpu) +{ + unsigned long max_cap = cpu_rq(cpu)->cpu_capacity_orig; + unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu); + + return cap_scale(max_cap, scale_freq); +} + +static inline bool energy_aware(void) +{ + return sched_feat(ENERGY_AWARE); +} + +static int cpu_util_wake(int cpu, struct task_struct *p); + +/* + * __cpu_norm_util() returns the cpu util relative to a specific capacity, + * i.e. it's busy ratio, in the range [0..SCHED_CAPACITY_SCALE] which is useful + * for energy calculations. Using the scale-invariant util returned by + * cpu_util() and approximating scale-invariant util by: + * + * util ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time + * + * the normalized util can be found using the specific capacity. + * + * capacity = capacity_orig * curr_freq/max_freq + * + * norm_util = running_time/time ~ util/capacity + */ +static unsigned long __cpu_norm_util(unsigned long util, unsigned long capacity) +{ + if (util >= capacity) + return SCHED_CAPACITY_SCALE; + + return (util << SCHED_CAPACITY_SHIFT)/capacity; +} + +/* + * CPU candidates. + * + * These are labels to reference CPU candidates for an energy_diff. + * Currently we support only two possible candidates: the task's previous CPU + * and another candiate CPU. + * More advanced/aggressive EAS selection policies can consider more + * candidates. + */ +#define EAS_CPU_PRV 0 +#define EAS_CPU_NXT 1 +#define EAS_CPU_BKP 2 + +/* + * energy_diff - supports the computation of the estimated energy impact in + * moving a "task"'s "util_delta" between different CPU candidates. + */ +/* + * NOTE: When using or examining WALT task signals, all wakeup + * latency is included as busy time for task util. + * + * This is relevant here because: + * When debugging is enabled, it can take as much as 1ms to + * write the output to the trace buffer for each eenv + * scenario. For periodic tasks where the sleep time is of + * a similar order, the WALT task util can be inflated. + * + * Further, and even without debugging enabled, + * task wakeup latency changes depending upon the EAS + * wakeup algorithm selected - FIND_BEST_TARGET only does + * energy calculations for up to 2 candidate CPUs. When + * NO_FIND_BEST_TARGET is configured, we can potentially + * do an energy calculation across all CPUS in the system. + * + * The impact to WALT task util on a Juno board + * running a periodic task which only sleeps for 200usec + * between 1ms activations has been measured. + * (i.e. the wakeup latency induced by energy calculation + * and debug output is double the desired sleep time and + * almost equivalent to the runtime which is more-or-less + * the worst case possible for this test) + * + * In this scenario, a task which has a PELT util of around + * 220 is inflated under WALT to have util around 400. + * + * This is simply a property of the way WALT includes + * wakeup latency in busy time while PELT does not. + * + * Hence - be careful when enabling DEBUG_EENV_DECISIONS + * expecially if WALT is the task signal. + */ +/*#define DEBUG_EENV_DECISIONS*/ + +#ifdef DEBUG_EENV_DECISIONS +/* max of 8 levels of sched groups traversed */ +#define EAS_EENV_DEBUG_LEVELS 16 + +struct _eenv_debug { + unsigned long cap; + unsigned long norm_util; + unsigned long cap_energy; + unsigned long idle_energy; + unsigned long this_energy; + unsigned long this_busy_energy; + unsigned long this_idle_energy; + cpumask_t group_cpumask; + unsigned long cpu_util[1]; +}; +#endif + +struct eenv_cpu { + /* CPU ID, must be in cpus_mask */ + int cpu_id; + + /* + * Index (into sched_group_energy::cap_states) of the OPP the + * CPU needs to run at if the task is placed on it. + * This includes the both active and blocked load, due to + * other tasks on this CPU, as well as the task's own + * utilization. + */ + int cap_idx; + int cap; + + /* Estimated system energy */ + unsigned long energy; + + /* Estimated energy variation wrt EAS_CPU_PRV */ + long nrg_delta; + +#ifdef DEBUG_EENV_DECISIONS + struct _eenv_debug *debug; + int debug_idx; +#endif /* DEBUG_EENV_DECISIONS */ +}; + +struct energy_env { + /* Utilization to move */ + struct task_struct *p; + unsigned long util_delta; + unsigned long util_delta_boosted; + + /* Mask of CPUs candidates to evaluate */ + cpumask_t cpus_mask; + + /* CPU candidates to evaluate */ + struct eenv_cpu *cpu; + int eenv_cpu_count; + +#ifdef DEBUG_EENV_DECISIONS + /* pointer to the memory block reserved + * for debug on this CPU - there will be + * sizeof(struct _eenv_debug) * + * (EAS_CPU_CNT * EAS_EENV_DEBUG_LEVELS) + * bytes allocated here. + */ + struct _eenv_debug *debug; +#endif + /* + * Index (into energy_env::cpu) of the morst energy efficient CPU for + * the specified energy_env::task + */ + int next_idx; + int max_cpu_count; + + /* Support data */ + struct sched_group *sg_top; + struct sched_group *sg_cap; + struct sched_group *sg; +}; + +static int cpu_util_wake(int cpu, struct task_struct *p); + +static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) +{ + unsigned long max_util = 0; + unsigned long util; + int cpu; + + for_each_cpu(cpu, sched_group_span(eenv->sg_cap)) { + util = cpu_util_wake(cpu, eenv->p); + + /* + * If we are looking at the target CPU specified by the eenv, + * then we should add the (estimated) utilization of the task + * assuming we will wake it up on that CPU. + */ + if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id)) + util += eenv->util_delta_boosted; + + max_util = max(max_util, util); + } + + return max_util; +} + +/* + * group_norm_util() returns the approximated group util relative to it's + * current capacity (busy ratio) in the range [0..SCHED_CAPACITY_SCALE] for use + * in energy calculations. Since task executions may or may not overlap in time + * in the group the true normalized util is between max(cpu_norm_util(i)) and + * sum(cpu_norm_util(i)) when iterating over all cpus in the group, i. The + * latter is used as the estimate as it leads to a more pessimistic energy + * estimate (more busy). + */ +static unsigned +long group_norm_util(struct energy_env *eenv, int cpu_idx) +{ + unsigned long capacity = eenv->cpu[cpu_idx].cap; + unsigned long util, util_sum = 0; + int cpu; + + for_each_cpu(cpu, sched_group_span(eenv->sg)) { + util = cpu_util_wake(cpu, eenv->p); + + /* + * If we are looking at the target CPU specified by the eenv, + * then we should add the (estimated) utilization of the task + * assuming we will wake it up on that CPU. + */ + if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id)) + util += eenv->util_delta; + + util_sum += __cpu_norm_util(util, capacity); + } + + if (util_sum > SCHED_CAPACITY_SCALE) + return SCHED_CAPACITY_SCALE; + return util_sum; +} + +static int find_new_capacity(struct energy_env *eenv, int cpu_idx) +{ + const struct sched_group_energy *sge = eenv->sg_cap->sge; + unsigned long util = group_max_util(eenv, cpu_idx); + int idx, cap_idx; + + cap_idx = sge->nr_cap_states - 1; + + for (idx = 0; idx < sge->nr_cap_states; idx++) { + if (sge->cap_states[idx].cap >= util) { + cap_idx = idx; + break; + } + } + /* Keep track of SG's capacity */ + eenv->cpu[cpu_idx].cap = sge->cap_states[cap_idx].cap; + eenv->cpu[cpu_idx].cap_idx = cap_idx; + + return cap_idx; +} + +static int group_idle_state(struct energy_env *eenv, int cpu_idx) +{ + struct sched_group *sg = eenv->sg; + int src_in_grp, dst_in_grp; + int i, state = INT_MAX; + int max_idle_state_idx; + long grp_util = 0; + int new_state; + + /* Find the shallowest idle state in the sched group. */ + for_each_cpu(i, sched_group_span(sg)) + state = min(state, idle_get_state_idx(cpu_rq(i))); + + /* Take non-cpuidle idling into account (active idle/arch_cpu_idle()) */ + state++; + /* + * Try to estimate if a deeper idle state is + * achievable when we move the task. + */ + for_each_cpu(i, sched_group_span(sg)) + grp_util += cpu_util(i); + + src_in_grp = cpumask_test_cpu(eenv->cpu[EAS_CPU_PRV].cpu_id, + sched_group_span(sg)); + dst_in_grp = cpumask_test_cpu(eenv->cpu[cpu_idx].cpu_id, + sched_group_span(sg)); + if (src_in_grp == dst_in_grp) { + /* + * both CPUs under consideration are in the same group or not in + * either group, migration should leave idle state the same. + */ + return state; + } + /* + * add or remove util as appropriate to indicate what group util + * will be (worst case - no concurrent execution) after moving the task + */ + grp_util += src_in_grp ? -eenv->util_delta : eenv->util_delta; + + if (grp_util > + ((long)sg->sgc->max_capacity * (int)sg->group_weight)) { + /* + * After moving, the group will be fully occupied + * so assume it will not be idle at all. + */ + return 0; + } + + /* + * after moving, this group is at most partly + * occupied, so it should have some idle time. + */ + max_idle_state_idx = sg->sge->nr_idle_states - 2; + new_state = grp_util * max_idle_state_idx; + if (grp_util <= 0) { + /* group will have no util, use lowest state */ + new_state = max_idle_state_idx + 1; + } else { + /* + * for partially idle, linearly map util to idle + * states, excluding the lowest one. This does not + * correspond to the state we expect to enter in + * reality, but an indication of what might happen. + */ + new_state = min_t(int, max_idle_state_idx, + new_state / sg->sgc->max_capacity); + new_state = max_idle_state_idx - new_state; + } + return new_state; +} + +#ifdef DEBUG_EENV_DECISIONS +static struct _eenv_debug *eenv_debug_entry_ptr(struct _eenv_debug *base, int idx); + +static void store_energy_calc_debug_info(struct energy_env *eenv, int cpu_idx, int cap_idx, int idle_idx) +{ + int debug_idx = eenv->cpu[cpu_idx].debug_idx; + unsigned long sg_util, busy_energy, idle_energy; + const struct sched_group_energy *sge; + struct _eenv_debug *dbg; + int cpu; + + if (debug_idx < EAS_EENV_DEBUG_LEVELS) { + sge = eenv->sg->sge; + sg_util = group_norm_util(eenv, cpu_idx); + busy_energy = sge->cap_states[cap_idx].power; + busy_energy *= sg_util; + idle_energy = SCHED_CAPACITY_SCALE - sg_util; + idle_energy *= sge->idle_states[idle_idx].power; + /* should we use sg_cap or sg? */ + dbg = eenv_debug_entry_ptr(eenv->cpu[cpu_idx].debug, debug_idx); + dbg->cap = sge->cap_states[cap_idx].cap; + dbg->norm_util = sg_util; + dbg->cap_energy = sge->cap_states[cap_idx].power; + dbg->idle_energy = sge->idle_states[idle_idx].power; + dbg->this_energy = busy_energy + idle_energy; + dbg->this_busy_energy = busy_energy; + dbg->this_idle_energy = idle_energy; + + cpumask_copy(&dbg->group_cpumask, + sched_group_span(eenv->sg)); + + for_each_cpu(cpu, &dbg->group_cpumask) + dbg->cpu_util[cpu] = cpu_util(cpu); + + eenv->cpu[cpu_idx].debug_idx = debug_idx+1; + } +} +#else +#define store_energy_calc_debug_info(a,b,c,d) {} +#endif /* DEBUG_EENV_DECISIONS */ + +/* + * calc_sg_energy: compute energy for the eenv's SG (i.e. eenv->sg). + * + * This works in iterations to compute the SG's energy for each CPU + * candidate defined by the energy_env's cpu array. + */ +static void calc_sg_energy(struct energy_env *eenv) +{ + struct sched_group *sg = eenv->sg; + unsigned long busy_energy, idle_energy; + unsigned int busy_power, idle_power; + unsigned long total_energy = 0; + unsigned long sg_util; + int cap_idx, idle_idx; + int cpu_idx; + + for (cpu_idx = EAS_CPU_PRV; cpu_idx < eenv->max_cpu_count; ++cpu_idx) { + if (eenv->cpu[cpu_idx].cpu_id == -1) + continue; + + /* Compute ACTIVE energy */ + cap_idx = find_new_capacity(eenv, cpu_idx); + busy_power = sg->sge->cap_states[cap_idx].power; + sg_util = group_norm_util(eenv, cpu_idx); + busy_energy = sg_util * busy_power; + + /* Compute IDLE energy */ + idle_idx = group_idle_state(eenv, cpu_idx); + idle_power = sg->sge->idle_states[idle_idx].power; + idle_energy = SCHED_CAPACITY_SCALE - sg_util; + idle_energy *= idle_power; + + total_energy = busy_energy + idle_energy; + eenv->cpu[cpu_idx].energy += total_energy; + + store_energy_calc_debug_info(eenv, cpu_idx, cap_idx, idle_idx); + } +} + +/* + * compute_energy() computes the absolute variation in energy consumption by + * moving eenv.util_delta from EAS_CPU_PRV to EAS_CPU_NXT. + * + * NOTE: compute_energy() may fail when racing with sched_domain updates, in + * which case we abort by returning -EINVAL. + */ +static int compute_energy(struct energy_env *eenv) +{ + struct sched_domain *sd; + int cpu; + struct cpumask visit_cpus; + struct sched_group *sg; + + WARN_ON(!eenv->sg_top->sge); + + cpumask_copy(&visit_cpus, sched_group_span(eenv->sg_top)); + + while (!cpumask_empty(&visit_cpus)) { + struct sched_group *sg_shared_cap = NULL; + + cpu = cpumask_first(&visit_cpus); + + /* + * Is the group utilization affected by cpus outside this + * sched_group? + */ + sd = rcu_dereference(per_cpu(sd_scs, cpu)); + if (sd && sd->parent) + sg_shared_cap = sd->parent->groups; + + for_each_domain(cpu, sd) { + sg = sd->groups; + + /* Has this sched_domain already been visited? */ + if (sd->child && group_first_cpu(sg) != cpu) + break; + + do { + eenv->sg_cap = sg; + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) + eenv->sg_cap = sg_shared_cap; + + /* + * Compute the energy for all the candidate + * CPUs in the current visited SG. + */ + eenv->sg = sg; + calc_sg_energy(eenv); + + /* remove CPUs we have just visited */ + if (!sd->child) + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_span(sg)); + + if (cpumask_equal(sched_group_span(sg), sched_group_span(eenv->sg_top))) + goto next_cpu; + + } while (sg = sg->next, sg != sd->groups); + } +next_cpu: + continue; + } + + return 0; +} + +static inline bool cpu_in_sg(struct sched_group *sg, int cpu) +{ + return cpu != -1 && cpumask_test_cpu(cpu, sched_group_span(sg)); +} + +#ifdef DEBUG_EENV_DECISIONS +static void dump_eenv_debug(struct energy_env *eenv) +{ + int cpu_idx, grp_idx; + char cpu_utils[(NR_CPUS*12)+10]="cpu_util: "; + char cpulist[64]; + + trace_printk("eenv scenario: task=%p %s task_util=%lu prev_cpu=%d", + eenv->p, eenv->p->comm, eenv->util_delta, eenv->cpu[EAS_CPU_PRV].cpu_id); + + for (cpu_idx=EAS_CPU_PRV; cpu_idx < eenv->max_cpu_count; cpu_idx++) { + if (eenv->cpu[cpu_idx].cpu_id == -1) + continue; + trace_printk("---Scenario %d: Place task on cpu %d energy=%lu (%d debug logs at %p)", + cpu_idx+1, eenv->cpu[cpu_idx].cpu_id, + eenv->cpu[cpu_idx].energy >> SCHED_CAPACITY_SHIFT, + eenv->cpu[cpu_idx].debug_idx, + eenv->cpu[cpu_idx].debug); + for (grp_idx = 0; grp_idx < eenv->cpu[cpu_idx].debug_idx; grp_idx++) { + struct _eenv_debug *debug; + int cpu, written=0; + + debug = eenv_debug_entry_ptr(eenv->cpu[cpu_idx].debug, grp_idx); + cpu = scnprintf(cpulist, sizeof(cpulist), "%*pbl", cpumask_pr_args(&debug->group_cpumask)); + + cpu_utils[0] = 0; + /* print out the relevant cpu_util */ + for_each_cpu(cpu, &(debug->group_cpumask)) { + char tmp[64]; + if (written > sizeof(cpu_utils)-10) { + cpu_utils[written]=0; + break; + } + written += snprintf(tmp, sizeof(tmp), "cpu%d(%lu) ", cpu, debug->cpu_util[cpu]); + strcat(cpu_utils, tmp); + } + /* trace the data */ + trace_printk(" | %s : cap=%lu nutil=%lu, cap_nrg=%lu, idle_nrg=%lu energy=%lu busy_energy=%lu idle_energy=%lu %s", + cpulist, debug->cap, debug->norm_util, + debug->cap_energy, debug->idle_energy, + debug->this_energy >> SCHED_CAPACITY_SHIFT, + debug->this_busy_energy >> SCHED_CAPACITY_SHIFT, + debug->this_idle_energy >> SCHED_CAPACITY_SHIFT, + cpu_utils); + + } + trace_printk("---"); + } + trace_printk("----- done"); + return; +} +#else +#define dump_eenv_debug(a) {} +#endif /* DEBUG_EENV_DECISIONS */ +/* + * select_energy_cpu_idx(): estimate the energy impact of changing the + * utilization distribution. + * + * The eenv parameter specifies the changes: utilization amount and a + * collection of possible CPU candidates. The number of candidates + * depends upon the selection algorithm used. + * + * If find_best_target was used to select candidate CPUs, there will + * be at most 3 including prev_cpu. If not, we used a brute force + * selection which will provide the union of: + * * CPUs belonging to the highest sd which is not overutilized + * * CPUs the task is allowed to run on + * * online CPUs + * + * This function returns the index of a CPU candidate specified by the + * energy_env which corresponds to the most energy efficient CPU. + * Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy + * efficient than running on prev_cpu. This is also the value returned in case + * of abort due to error conditions during the computations. The only + * exception to this if we fail to access the energy model via sd_ea, where + * we return -1 with the intent of asking the system to use a different + * wakeup placement algorithm. + * + * A value greater than zero means that the most energy efficient CPU is the + * one represented by eenv->cpu[eenv->next_idx].cpu_id. + */ +static inline int select_energy_cpu_idx(struct energy_env *eenv) +{ + int last_cpu_idx = eenv->max_cpu_count - 1; + struct sched_domain *sd; + struct sched_group *sg; + int sd_cpu = -1; + int cpu_idx; + int margin; + + sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; + sd = rcu_dereference(per_cpu(sd_ea, sd_cpu)); + if (!sd) + return -1; + + cpumask_clear(&eenv->cpus_mask); + for (cpu_idx = EAS_CPU_PRV; cpu_idx < eenv->max_cpu_count; ++cpu_idx) { + int cpu = eenv->cpu[cpu_idx].cpu_id; + + if (cpu < 0) + continue; + cpumask_set_cpu(cpu, &eenv->cpus_mask); + } + + sg = sd->groups; + do { + /* Skip SGs which do not contains a candidate CPU */ + if (!cpumask_intersects(&eenv->cpus_mask, sched_group_span(sg))) + continue; + + eenv->sg_top = sg; + if (compute_energy(eenv) == -EINVAL) + return EAS_CPU_PRV; + } while (sg = sg->next, sg != sd->groups); + /* remember - eenv energy values are unscaled */ + + /* + * Compute the dead-zone margin used to prevent too many task + * migrations with negligible energy savings. + * An energy saving is considered meaningful if it reduces the energy + * consumption of EAS_CPU_PRV CPU candidate by at least ~1.56% + */ + margin = eenv->cpu[EAS_CPU_PRV].energy >> 6; + + /* + * By default the EAS_CPU_PRV CPU is considered the most energy + * efficient, with a 0 energy variation. + */ + eenv->next_idx = EAS_CPU_PRV; + eenv->cpu[EAS_CPU_PRV].nrg_delta = 0; + + dump_eenv_debug(eenv); + + /* + * Compare the other CPU candidates to find a CPU which can be + * more energy efficient then EAS_CPU_PRV + */ + if (sched_feat(FBT_STRICT_ORDER)) + last_cpu_idx = EAS_CPU_BKP; + + for(cpu_idx = EAS_CPU_NXT; cpu_idx <= last_cpu_idx; cpu_idx++) { + if (eenv->cpu[cpu_idx].cpu_id < 0) + continue; + eenv->cpu[cpu_idx].nrg_delta = + eenv->cpu[cpu_idx].energy - + eenv->cpu[EAS_CPU_PRV].energy; + + /* filter energy variations within the dead-zone margin */ + if (abs(eenv->cpu[cpu_idx].nrg_delta) < margin) + eenv->cpu[cpu_idx].nrg_delta = 0; + /* update the schedule candidate with min(nrg_delta) */ + if (eenv->cpu[cpu_idx].nrg_delta < + eenv->cpu[eenv->next_idx].nrg_delta) { + eenv->next_idx = cpu_idx; + /* break out if we want to stop on first saving candidate */ + if (sched_feat(FBT_STRICT_ORDER)) + break; + } + } + + return eenv->next_idx; +} + +/* * Detect M:N waker/wakee relationships via a switching-frequency heuristic. * * A waker of many should wake a different task than the one last awakened @@ -5345,15 +6113,18 @@ static void record_wakee(struct task_struct *p) * whatever is irrelevant, spread criteria is apparent partner count exceeds * socket size. */ -static int wake_wide(struct task_struct *p) +static int wake_wide(struct task_struct *p, int sibling_count_hint) { unsigned int master = current->wakee_flips; unsigned int slave = p->wakee_flips; - int factor = this_cpu_read(sd_llc_size); + int llc_size = this_cpu_read(sd_llc_size); + + if (sibling_count_hint >= llc_size) + return 1; if (master < slave) swap(master, slave); - if (slave < factor || master < slave * factor) + if (slave < llc_size || master < slave * llc_size) return 0; return 1; } @@ -5439,8 +6210,101 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, return affine; } -static inline int task_util(struct task_struct *p); -static int cpu_util_wake(int cpu, struct task_struct *p); +static inline unsigned long task_util(struct task_struct *p); + +#ifdef CONFIG_SCHED_TUNE +struct reciprocal_value schedtune_spc_rdiv; + +static long +schedtune_margin(unsigned long signal, long boost) +{ + long long margin = 0; + + /* + * Signal proportional compensation (SPC) + * + * The Boost (B) value is used to compute a Margin (M) which is + * proportional to the complement of the original Signal (S): + * M = B * (SCHED_CAPACITY_SCALE - S) + * The obtained M could be used by the caller to "boost" S. + */ + if (boost >= 0) { + margin = SCHED_CAPACITY_SCALE - signal; + margin *= boost; + } else + margin = -signal * boost; + + margin = reciprocal_divide(margin, schedtune_spc_rdiv); + + if (boost < 0) + margin *= -1; + return margin; +} + +static inline int +schedtune_cpu_margin(unsigned long util, int cpu) +{ + int boost = schedtune_cpu_boost(cpu); + + if (boost == 0) + return 0; + + return schedtune_margin(util, boost); +} + +static inline long +schedtune_task_margin(struct task_struct *task) +{ + int boost = schedtune_task_boost(task); + unsigned long util; + long margin; + + if (boost == 0) + return 0; + + util = task_util(task); + margin = schedtune_margin(util, boost); + + return margin; +} + +#else /* CONFIG_SCHED_TUNE */ + +static inline int +schedtune_cpu_margin(unsigned long util, int cpu) +{ + return 0; +} + +static inline int +schedtune_task_margin(struct task_struct *task) +{ + return 0; +} + +#endif /* CONFIG_SCHED_TUNE */ + +unsigned long +boosted_cpu_util(int cpu) +{ + unsigned long util = cpu_util_freq(cpu); + long margin = schedtune_cpu_margin(util, cpu); + + trace_sched_boost_cpu(cpu, util, margin); + + return util + margin; +} + +static inline unsigned long +boosted_task_util(struct task_struct *task) +{ + unsigned long util = task_util(task); + long margin = schedtune_task_margin(task); + + trace_sched_boost_task(task, util, margin); + + return util + margin; +} static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { @@ -5450,6 +6314,8 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) /* * find_idlest_group finds and returns the least busy CPU group within the * domain. + * + * Assumes p is allowed on at least one CPU in sd. */ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, @@ -5457,8 +6323,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; - unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; + unsigned long min_runnable_load = ULONG_MAX; + unsigned long this_runnable_load = ULONG_MAX; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; @@ -5579,10 +6446,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* - * find_idlest_cpu - find the idlest cpu among the cpus in group. + * find_idlest_group_cpu - find the idlest cpu among the cpus in group. */ static int -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; @@ -5631,6 +6498,53 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; } +static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, + int cpu, int prev_cpu, int sd_flag) +{ + int new_cpu = cpu; + + if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed)) + return prev_cpu; + + while (sd) { + struct sched_group *group; + struct sched_domain *tmp; + int weight; + + if (!(sd->flags & sd_flag)) { + sd = sd->child; + continue; + } + + group = find_idlest_group(sd, p, cpu, sd_flag); + if (!group) { + sd = sd->child; + continue; + } + + new_cpu = find_idlest_group_cpu(group, p, cpu); + if (new_cpu == cpu) { + /* Now try balancing at a lower domain level of cpu */ + sd = sd->child; + continue; + } + + /* Now try balancing at a lower domain level of new_cpu */ + cpu = new_cpu; + weight = sd->span_weight; + sd = NULL; + for_each_domain(cpu, tmp) { + if (weight <= tmp->span_weight) + break; + if (tmp->flags & sd_flag) + sd = tmp; + } + /* while loop will break here if sd == NULL */ + } + + return new_cpu; +} + #ifdef CONFIG_SCHED_SMT static inline void set_idle_cores(int cpu, int val) @@ -5812,7 +6726,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t /* * Try and locate an idle core/thread in the LLC cache domain. */ -static int select_idle_sibling(struct task_struct *p, int prev, int target) +static inline int __select_idle_sibling(struct task_struct *p, int prev, int target) { struct sched_domain *sd; int i; @@ -5845,42 +6759,86 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) return target; } -/* - * cpu_util returns the amount of capacity of a CPU that is used by CFS - * tasks. The unit of the return value must be the one of capacity so we can - * compare the utilization with the capacity of the CPU that is available for - * CFS task (ie cpu_capacity). - * - * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the - * recent utilization of currently non-runnable tasks on a CPU. It represents - * the amount of utilization of a CPU in the range [0..capacity_orig] where - * capacity_orig is the cpu_capacity available at the highest frequency - * (arch_scale_freq_capacity()). - * The utilization of a CPU converges towards a sum equal to or less than the - * current capacity (capacity_curr <= capacity_orig) of the CPU because it is - * the running time on this CPU scaled by capacity_curr. - * - * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even - * higher than capacity_orig because of unfortunate rounding in - * cfs.avg.util_avg or just after migrating tasks and new task wakeups until - * the average stabilizes with the new running time. We need to check that the - * utilization stays within the range of [0..capacity_orig] and cap it if - * necessary. Without utilization capping, a group could be seen as overloaded - * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of - * available capacity. We allow utilization to overshoot capacity_curr (but not - * capacity_orig) as it useful for predicting the capacity required after task - * migrations (scheduler-driven DVFS). - */ -static int cpu_util(int cpu) +static inline int select_idle_sibling_cstate_aware(struct task_struct *p, int prev, int target) { - unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; - unsigned long capacity = capacity_orig_of(cpu); + struct sched_domain *sd; + struct sched_group *sg; + int best_idle_cpu = -1; + int best_idle_cstate = -1; + int best_idle_capacity = INT_MAX; + int i; - return (util >= capacity) ? capacity : util; + /* + * Iterate the domains and find an elegible idle cpu. + */ + sd = rcu_dereference(per_cpu(sd_llc, target)); + for_each_lower_domain(sd) { + sg = sd->groups; + do { + if (!cpumask_intersects( + sched_group_span(sg), &p->cpus_allowed)) + goto next; + + for_each_cpu_and(i, &p->cpus_allowed, sched_group_span(sg)) { + int idle_idx; + unsigned long new_usage; + unsigned long capacity_orig; + + if (!idle_cpu(i)) + goto next; + + /* figure out if the task can fit here at all */ + new_usage = boosted_task_util(p); + capacity_orig = capacity_orig_of(i); + + if (new_usage > capacity_orig) + goto next; + + /* if the task fits without changing OPP and we + * intended to use this CPU, just proceed + */ + if (i == target && new_usage <= capacity_curr_of(target)) { + return target; + } + + /* otherwise select CPU with shallowest idle state + * to reduce wakeup latency. + */ + idle_idx = idle_get_state_idx(cpu_rq(i)); + + if (idle_idx < best_idle_cstate && + capacity_orig <= best_idle_capacity) { + best_idle_cpu = i; + best_idle_cstate = idle_idx; + best_idle_capacity = capacity_orig; + } + } + next: + sg = sg->next; + } while (sg != sd->groups); + } + + if (best_idle_cpu >= 0) + target = best_idle_cpu; + + return target; } -static inline int task_util(struct task_struct *p) +static int select_idle_sibling(struct task_struct *p, int prev, int target) { + if (!sysctl_sched_cstate_aware) + return __select_idle_sibling(p, prev, target); + + return select_idle_sibling_cstate_aware(p, prev, target); +} + +static inline unsigned long task_util(struct task_struct *p) +{ +#ifdef CONFIG_SCHED_WALT + if (!walt_disabled && sysctl_sched_use_walt_task_util) { + return (p->ravg.demand / (walt_ravg_window >> SCHED_CAPACITY_SHIFT)); + } +#endif return p->se.avg.util_avg; } @@ -5892,6 +6850,16 @@ static int cpu_util_wake(int cpu, struct task_struct *p) { unsigned long util, capacity; +#ifdef CONFIG_SCHED_WALT + /* + * WALT does not decay idle tasks in the same manner + * as PELT, so it makes little sense to subtract task + * utilization from cpu utilization. Instead just use + * cpu_util for this case. + */ + if (!walt_disabled && sysctl_sched_use_walt_cpu_util) + return cpu_util(cpu); +#endif /* Task has no contribution or is new */ if (cpu != task_cpu(p) || !p->se.avg.last_update_time) return cpu_util(cpu); @@ -5902,6 +6870,291 @@ static int cpu_util_wake(int cpu, struct task_struct *p) return (util >= capacity) ? capacity : util; } +static inline int task_fits_capacity(struct task_struct *p, long capacity) +{ + return capacity * 1024 > boosted_task_util(p) * capacity_margin; +} + +static int start_cpu(bool boosted) +{ + struct root_domain *rd = cpu_rq(smp_processor_id())->rd; + + return boosted ? rd->max_cap_orig_cpu : rd->min_cap_orig_cpu; +} + +static inline int find_best_target(struct task_struct *p, int *backup_cpu, + bool boosted, bool prefer_idle) +{ + unsigned long best_idle_min_cap_orig = ULONG_MAX; + unsigned long min_util = boosted_task_util(p); + unsigned long target_capacity = ULONG_MAX; + unsigned long min_wake_util = ULONG_MAX; + unsigned long target_max_spare_cap = 0; + unsigned long target_util = ULONG_MAX; + unsigned long best_active_util = ULONG_MAX; + int best_idle_cstate = INT_MAX; + struct sched_domain *sd; + struct sched_group *sg; + int best_active_cpu = -1; + int best_idle_cpu = -1; + int target_cpu = -1; + int cpu, i; + + *backup_cpu = -1; + + /* Find start CPU based on boost value */ + cpu = start_cpu(boosted); + if (cpu < 0) + return -1; + + /* Find SD for the start CPU */ + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + if (!sd) + return -1; + + /* Scan CPUs in all SDs */ + sg = sd->groups; + do { + for_each_cpu_and(i, &p->cpus_allowed, sched_group_span(sg)) { + unsigned long capacity_curr = capacity_curr_of(i); + unsigned long capacity_orig = capacity_orig_of(i); + unsigned long wake_util, new_util; + + if (!cpu_online(i)) + continue; + + if (walt_cpu_high_irqload(i)) + continue; + + /* + * p's blocked utilization is still accounted for on prev_cpu + * so prev_cpu will receive a negative bias due to the double + * accounting. However, the blocked utilization may be zero. + */ + wake_util = cpu_util_wake(i, p); + new_util = wake_util + task_util(p); + + /* + * Ensure minimum capacity to grant the required boost. + * The target CPU can be already at a capacity level higher + * than the one required to boost the task. + */ + new_util = max(min_util, new_util); + if (new_util > capacity_orig) + continue; + + /* + * Case A) Latency sensitive tasks + * + * Unconditionally favoring tasks that prefer idle CPU to + * improve latency. + * + * Looking for: + * - an idle CPU, whatever its idle_state is, since + * the first CPUs we explore are more likely to be + * reserved for latency sensitive tasks. + * - a non idle CPU where the task fits in its current + * capacity and has the maximum spare capacity. + * - a non idle CPU with lower contention from other + * tasks and running at the lowest possible OPP. + * + * The last two goals tries to favor a non idle CPU + * where the task can run as if it is "almost alone". + * A maximum spare capacity CPU is favoured since + * the task already fits into that CPU's capacity + * without waiting for an OPP chance. + * + * The following code path is the only one in the CPUs + * exploration loop which is always used by + * prefer_idle tasks. It exits the loop with wither a + * best_active_cpu or a target_cpu which should + * represent an optimal choice for latency sensitive + * tasks. + */ + if (prefer_idle) { + + /* + * Case A.1: IDLE CPU + * Return the first IDLE CPU we find. + */ + if (idle_cpu(i)) { + trace_sched_find_best_target(p, + prefer_idle, min_util, + cpu, best_idle_cpu, + best_active_cpu, i); + + return i; + } + + /* + * Case A.2: Target ACTIVE CPU + * Favor CPUs with max spare capacity. + */ + if ((capacity_curr > new_util) && + (capacity_orig - new_util > target_max_spare_cap)) { + target_max_spare_cap = capacity_orig - new_util; + target_cpu = i; + continue; + } + if (target_cpu != -1) + continue; + + + /* + * Case A.3: Backup ACTIVE CPU + * Favor CPUs with: + * - lower utilization due to other tasks + * - lower utilization with the task in + */ + if (wake_util > min_wake_util) + continue; + if (new_util > best_active_util) + continue; + min_wake_util = wake_util; + best_active_util = new_util; + best_active_cpu = i; + continue; + } + + /* + * Enforce EAS mode + * + * For non latency sensitive tasks, skip CPUs that + * will be overutilized by moving the task there. + * + * The goal here is to remain in EAS mode as long as + * possible at least for !prefer_idle tasks. + */ + if ((new_util * capacity_margin) > + (capacity_orig * SCHED_CAPACITY_SCALE)) + continue; + + /* + * Case B) Non latency sensitive tasks on IDLE CPUs. + * + * Find an optimal backup IDLE CPU for non latency + * sensitive tasks. + * + * Looking for: + * - minimizing the capacity_orig, + * i.e. preferring LITTLE CPUs + * - favoring shallowest idle states + * i.e. avoid to wakeup deep-idle CPUs + * + * The following code path is used by non latency + * sensitive tasks if IDLE CPUs are available. If at + * least one of such CPUs are available it sets the + * best_idle_cpu to the most suitable idle CPU to be + * selected. + * + * If idle CPUs are available, favour these CPUs to + * improve performances by spreading tasks. + * Indeed, the energy_diff() computed by the caller + * will take care to ensure the minimization of energy + * consumptions without affecting performance. + */ + if (idle_cpu(i)) { + int idle_idx = idle_get_state_idx(cpu_rq(i)); + + /* Select idle CPU with lower cap_orig */ + if (capacity_orig > best_idle_min_cap_orig) + continue; + + /* + * Skip CPUs in deeper idle state, but only + * if they are also less energy efficient. + * IOW, prefer a deep IDLE LITTLE CPU vs a + * shallow idle big CPU. + */ + if (sysctl_sched_cstate_aware && + best_idle_cstate <= idle_idx) + continue; + + /* Keep track of best idle CPU */ + best_idle_min_cap_orig = capacity_orig; + best_idle_cstate = idle_idx; + best_idle_cpu = i; + continue; + } + + /* + * Case C) Non latency sensitive tasks on ACTIVE CPUs. + * + * Pack tasks in the most energy efficient capacities. + * + * This task packing strategy prefers more energy + * efficient CPUs (i.e. pack on smaller maximum + * capacity CPUs) while also trying to spread tasks to + * run them all at the lower OPP. + * + * This assumes for example that it's more energy + * efficient to run two tasks on two CPUs at a lower + * OPP than packing both on a single CPU but running + * that CPU at an higher OPP. + * + * Thus, this case keep track of the CPU with the + * smallest maximum capacity and highest spare maximum + * capacity. + */ + + /* Favor CPUs with smaller capacity */ + if (capacity_orig > target_capacity) + continue; + + /* Favor CPUs with maximum spare capacity */ + if ((capacity_orig - new_util) < target_max_spare_cap) + continue; + + target_max_spare_cap = capacity_orig - new_util; + target_capacity = capacity_orig; + target_util = new_util; + target_cpu = i; + } + + } while (sg = sg->next, sg != sd->groups); + + /* + * For non latency sensitive tasks, cases B and C in the previous loop, + * we pick the best IDLE CPU only if we was not able to find a target + * ACTIVE CPU. + * + * Policies priorities: + * + * - prefer_idle tasks: + * + * a) IDLE CPU available, we return immediately + * b) ACTIVE CPU where task fits and has the bigger maximum spare + * capacity (i.e. target_cpu) + * c) ACTIVE CPU with less contention due to other tasks + * (i.e. best_active_cpu) + * + * - NON prefer_idle tasks: + * + * a) ACTIVE CPU: target_cpu + * b) IDLE CPU: best_idle_cpu + */ + if (target_cpu == -1) + target_cpu = prefer_idle + ? best_active_cpu + : best_idle_cpu; + else + *backup_cpu = prefer_idle + ? best_active_cpu + : best_idle_cpu; + + trace_sched_find_best_target(p, prefer_idle, min_util, cpu, + best_idle_cpu, best_active_cpu, + target_cpu); + + /* it is possible for target and backup + * to select same CPU - if so, drop backup + */ + if (*backup_cpu == target_cpu) + *backup_cpu = -1; + + return target_cpu; +} + /* * Disable WAKE_AFFINE in the case where task @p doesn't fit in the * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. @@ -5923,7 +7176,288 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) /* Bring task utilization in sync with prev_cpu */ sync_entity_load_avg(&p->se); - return min_cap * 1024 < task_util(p) * capacity_margin; + return task_fits_capacity(p, min_cap); +} + +static bool cpu_overutilized(int cpu) +{ + return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); +} + +DEFINE_PER_CPU(struct energy_env, eenv_cache); + +/* kernels often have NR_CPUS defined to be much + * larger than exist in practise on booted systems. + * Allocate the cpu array for eenv calculations + * at boot time to avoid massive overprovisioning. + */ +#ifdef DEBUG_EENV_DECISIONS +static inline int eenv_debug_size_per_dbg_entry(void) +{ + return sizeof(struct _eenv_debug) + (sizeof(unsigned long) * num_possible_cpus()); +} + +static inline int eenv_debug_size_per_cpu_entry(void) +{ + /* each cpu struct has an array of _eenv_debug structs + * which have an array of unsigned longs at the end - + * the allocation should be extended so that there are + * at least 'num_possible_cpus' entries in the array. + */ + return EAS_EENV_DEBUG_LEVELS * eenv_debug_size_per_dbg_entry(); +} +/* given a per-_eenv_cpu debug env ptr, get the ptr for a given index */ +static inline struct _eenv_debug *eenv_debug_entry_ptr(struct _eenv_debug *base, int idx) +{ + char *ptr = (char *)base; + ptr += (idx * eenv_debug_size_per_dbg_entry()); + return (struct _eenv_debug *)ptr; +} +/* given a pointer to the per-cpu global copy of _eenv_debug, get + * a pointer to the specified _eenv_cpu debug env. + */ +static inline struct _eenv_debug *eenv_debug_percpu_debug_env_ptr(struct _eenv_debug *base, int cpu_idx) +{ + char *ptr = (char *)base; + ptr += (cpu_idx * eenv_debug_size_per_cpu_entry()); + return (struct _eenv_debug *)ptr; +} + +static inline int eenv_debug_size(void) +{ + return num_possible_cpus() * eenv_debug_size_per_cpu_entry(); +} +#endif + +static inline void alloc_eenv(void) +{ + int cpu; + int cpu_count = num_possible_cpus(); + + for_each_possible_cpu(cpu) { + struct energy_env *eenv = &per_cpu(eenv_cache, cpu); + eenv->cpu = kmalloc(sizeof(struct eenv_cpu) * cpu_count, GFP_KERNEL); + eenv->eenv_cpu_count = cpu_count; +#ifdef DEBUG_EENV_DECISIONS + eenv->debug = (struct _eenv_debug *)kmalloc(eenv_debug_size(), GFP_KERNEL); +#endif + } +} + +static inline void reset_eenv(struct energy_env *eenv) +{ + int cpu_count; + struct eenv_cpu *cpu; +#ifdef DEBUG_EENV_DECISIONS + struct _eenv_debug *debug; + int cpu_idx; + debug = eenv->debug; +#endif + + cpu_count = eenv->eenv_cpu_count; + cpu = eenv->cpu; + memset(eenv, 0, sizeof(struct energy_env)); + eenv->cpu = cpu; + memset(eenv->cpu, 0, sizeof(struct eenv_cpu)*cpu_count); + eenv->eenv_cpu_count = cpu_count; + +#ifdef DEBUG_EENV_DECISIONS + memset(debug, 0, eenv_debug_size()); + eenv->debug = debug; + for(cpu_idx = 0; cpu_idx < eenv->cpu_array_len; cpu_idx++) + eenv->cpu[cpu_idx].debug = eenv_debug_percpu_debug_env_ptr(debug, cpu_idx); +#endif +} +/* + * get_eenv - reset the eenv struct cached for this CPU + * + * When the eenv is returned, it is configured to do + * energy calculations for the maximum number of CPUs + * the task can be placed on. The prev_cpu entry is + * filled in here. Callers are responsible for adding + * other CPU candidates up to eenv->max_cpu_count. + */ +static inline struct energy_env *get_eenv(struct task_struct *p, int prev_cpu) +{ + struct energy_env *eenv; + cpumask_t cpumask_possible_cpus; + int cpu = smp_processor_id(); + int i; + + eenv = &(per_cpu(eenv_cache, cpu)); + reset_eenv(eenv); + + /* populate eenv */ + eenv->p = p; + /* use boosted task util for capacity selection + * during energy calculation, but unboosted task + * util for group utilization calculations + */ + eenv->util_delta = task_util(p); + eenv->util_delta_boosted = boosted_task_util(p); + + cpumask_and(&cpumask_possible_cpus, &p->cpus_allowed, cpu_online_mask); + eenv->max_cpu_count = cpumask_weight(&cpumask_possible_cpus); + + for (i=0; i < eenv->max_cpu_count; i++) + eenv->cpu[i].cpu_id = -1; + eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; + eenv->next_idx = EAS_CPU_PRV; + + return eenv; +} + +/* + * Needs to be called inside rcu_read_lock critical section. + * sd is a pointer to the sched domain we wish to use for an + * energy-aware placement option. + */ +static int find_energy_efficient_cpu(struct sched_domain *sd, + struct task_struct *p, + int cpu, int prev_cpu, + int sync) +{ + int use_fbt = sched_feat(FIND_BEST_TARGET); + int cpu_iter, eas_cpu_idx = EAS_CPU_NXT; + int energy_cpu = -1; + struct energy_env *eenv; + + if (sysctl_sched_sync_hint_enable && sync) { + if (cpumask_test_cpu(cpu, &p->cpus_allowed)) { + return cpu; + } + } + + /* prepopulate energy diff environment */ + eenv = get_eenv(p, prev_cpu); + if (eenv->max_cpu_count < 2) + return energy_cpu; + + if(!use_fbt) { + /* + * using this function outside wakeup balance will not supply + * an sd ptr. Instead, fetch the highest level with energy data. + */ + if (!sd) + sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); + + for_each_cpu_and(cpu_iter, &p->cpus_allowed, sched_domain_span(sd)) { + unsigned long spare; + + /* prev_cpu already in list */ + if (cpu_iter == prev_cpu) + continue; + + spare = capacity_spare_wake(cpu_iter, p); + + if (spare * 1024 < capacity_margin * task_util(p)) + continue; + + /* Add CPU candidate */ + eenv->cpu[eas_cpu_idx++].cpu_id = cpu_iter; + eenv->max_cpu_count = eas_cpu_idx; + + /* stop adding CPUs if we have no space left */ + if (eas_cpu_idx >= eenv->eenv_cpu_count) + break; + } + } else { + int boosted = (schedtune_task_boost(p) > 0); + int prefer_idle; + + /* + * give compiler a hint that if sched_features + * cannot be changed, it is safe to optimise out + * all if(prefer_idle) blocks. + */ + prefer_idle = sched_feat(EAS_PREFER_IDLE) ? + (schedtune_prefer_idle(p) > 0) : 0; + + eenv->max_cpu_count = EAS_CPU_BKP + 1; + + /* Find a cpu with sufficient capacity */ + eenv->cpu[EAS_CPU_NXT].cpu_id = find_best_target(p, + &eenv->cpu[EAS_CPU_BKP].cpu_id, + boosted, prefer_idle); + + /* take note if no backup was found */ + if (eenv->cpu[EAS_CPU_BKP].cpu_id < 0) + eenv->max_cpu_count = EAS_CPU_BKP; + + /* take note if no target was found */ + if (eenv->cpu[EAS_CPU_NXT].cpu_id < 0) + eenv->max_cpu_count = EAS_CPU_NXT; + } + + if (eenv->max_cpu_count == EAS_CPU_NXT) { + /* + * we did not find any energy-awareness + * candidates beyond prev_cpu, so we will + * fall-back to the regular slow-path. + */ + return energy_cpu; + } + + /* find most energy-efficient CPU */ + energy_cpu = select_energy_cpu_idx(eenv) < 0 ? -1 : + eenv->cpu[eenv->next_idx].cpu_id; + + return energy_cpu; +} + +static inline bool nohz_kick_needed(struct rq *rq, bool only_update); +static void nohz_balancer_kick(bool only_update); + +/* + * wake_energy: Make the decision if we want to use an energy-aware + * wakeup task placement or not. This is limited to situations where + * we cannot use energy-awareness right now. + * + * Returns TRUE if we should attempt energy-aware wakeup, FALSE if not. + * + * Should only be called from select_task_rq_fair inside the RCU + * read-side critical section. + */ +static inline int wake_energy(struct task_struct *p, int prev_cpu, + int sd_flag, int wake_flags) +{ + struct sched_domain *sd = NULL; + int sync = wake_flags & WF_SYNC; + + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd); + + /* + * Check all definite no-energy-awareness conditions + */ + if (!sd) + return false; + + if (!energy_aware()) + return false; + + if (sd_overutilized(sd)) + return false; + + /* + * we cannot do energy-aware wakeup placement sensibly + * for tasks with 0 utilization, so let them be placed + * according to the normal strategy. + * However if fbt is in use we may still benefit from + * the heuristics we use there in selecting candidate + * CPUs. + */ + if (unlikely(!sched_feat(FIND_BEST_TARGET) && !task_util(p))) + return false; + + if(!sched_feat(EAS_PREFER_IDLE)){ + /* + * Force prefer-idle tasks into the slow path, this may not happen + * if none of the sd flags matched. + */ + if (schedtune_prefer_idle(p) > 0 && !sync) + return false; + } + return true; } /* @@ -5939,21 +7473,28 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) * preempt must be disabled. */ static int -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags, + int sibling_count_hint) { - struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; + struct sched_domain *tmp, *affine_sd = NULL; + struct sched_domain *sd = NULL, *energy_sd = NULL; int cpu = smp_processor_id(); int new_cpu = prev_cpu; int want_affine = 0; + int want_energy = 0; int sync = wake_flags & WF_SYNC; + rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); - want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) - && cpumask_test_cpu(cpu, &p->cpus_allowed); + want_energy = wake_energy(p, prev_cpu, sd_flag, wake_flags); + want_affine = !want_energy && + !wake_wide(p, sibling_count_hint) && + !wake_cap(p, cpu, prev_cpu) && + cpumask_test_cpu(cpu, &p->cpus_allowed); } - rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) break; @@ -5968,9 +7509,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f break; } + /* + * If we are able to try an energy-aware wakeup, + * select the highest non-overutilized sched domain + * which includes this cpu and prev_cpu + * + * maybe want to not test prev_cpu and only consider + * the current one? + */ + if (want_energy && + !sd_overutilized(tmp) && + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) + energy_sd = tmp; + if (tmp->flags & sd_flag) sd = tmp; - else if (!want_affine) + else if (!(want_affine || want_energy)) break; } @@ -5983,47 +7537,38 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f new_cpu = cpu; } + if (sd && !(sd_flag & SD_BALANCE_FORK)) { + /* + * We're going to need the task's util for capacity_spare_wake + * in find_idlest_group. Sync it up to prev_cpu's + * last_update_time. + */ + sync_entity_load_avg(&p->se); + } + if (!sd) { - pick_cpu: +pick_cpu: if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - } else while (sd) { - struct sched_group *group; - int weight; + } else { + if (energy_sd) + new_cpu = find_energy_efficient_cpu(energy_sd, p, cpu, prev_cpu, sync); - if (!(sd->flags & sd_flag)) { - sd = sd->child; - continue; - } - - group = find_idlest_group(sd, p, cpu, sd_flag); - if (!group) { - sd = sd->child; - continue; - } - - new_cpu = find_idlest_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - /* Now try balancing at a lower domain level of cpu */ - sd = sd->child; - continue; - } - - /* Now try balancing at a lower domain level of new_cpu */ - cpu = new_cpu; - weight = sd->span_weight; - sd = NULL; - for_each_domain(cpu, tmp) { - if (weight <= tmp->span_weight) - break; - if (tmp->flags & sd_flag) - sd = tmp; - } - /* while loop will break here if sd == NULL */ + /* if we did an energy-aware placement and had no choices available + * then fall back to the default find_idlest_cpu choice + */ + if (!energy_sd || (energy_sd && new_cpu == -1)) + new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); } + rcu_read_unlock(); +#ifdef CONFIG_NO_HZ_COMMON + if (nohz_kick_needed(cpu_rq(new_cpu), true)) + nohz_balancer_kick(true); +#endif + return new_cpu; } @@ -6248,6 +7793,29 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ set_last_buddy(se); } +static inline void update_misfit_task(struct rq *rq, struct task_struct *p) +{ +#ifdef CONFIG_SMP + rq->misfit_task = !task_fits_capacity(p, capacity_of(rq->cpu)); +#endif +} + +static inline void clear_rq_misfit(struct rq *rq) +{ +#ifdef CONFIG_SMP + rq->misfit_task = 0; +#endif +} + +static inline unsigned int rq_has_misfit(struct rq *rq) +{ +#ifdef CONFIG_SMP + return rq->misfit_task; +#else + return 0; +#endif +} + static struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -6338,6 +7906,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_task(rq, p); + return p; simple: #endif @@ -6355,9 +7925,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_task(rq, p); + return p; idle: + clear_rq_misfit(rq); new_tasks = idle_balance(rq, rf); /* @@ -6563,6 +8136,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_NEED_BREAK 0x02 #define LBF_DST_PINNED 0x04 @@ -6581,6 +8161,7 @@ struct lb_env { int new_dst_cpu; enum cpu_idle_type idle; long imbalance; + unsigned int src_grp_nr_running; /* The set of CPUs under consideration for load-balancing */ struct cpumask *cpus; @@ -6591,6 +8172,7 @@ struct lb_env { unsigned int loop_max; enum fbq_type fbq_type; + enum group_type src_grp_type; struct list_head tasks; }; @@ -7004,6 +8586,10 @@ static void update_blocked_averages(int cpu) if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); } + update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0); +#ifdef CONFIG_NO_HZ_COMMON + rq->last_blocked_load_update_tick = jiffies; +#endif rq_unlock_irqrestore(rq, &rf); } @@ -7063,6 +8649,10 @@ static inline void update_blocked_averages(int cpu) rq_lock_irqsave(rq, &rf); update_rq_clock(rq); update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0); +#ifdef CONFIG_NO_HZ_COMMON + rq->last_blocked_load_update_tick = jiffies; +#endif rq_unlock_irqrestore(rq, &rf); } @@ -7074,12 +8664,6 @@ static unsigned long task_h_load(struct task_struct *p) /********** Helpers for find_busiest_group ************************/ -enum group_type { - group_other = 0, - group_imbalanced, - group_overloaded, -}; - /* * sg_lb_stats - stats of a sched_group required for load_balancing */ @@ -7095,6 +8679,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; int group_no_capacity; + int group_misfit_task; /* A cpu has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -7111,6 +8696,7 @@ struct sd_lb_stats { unsigned long total_running; unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7131,6 +8717,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .total_running = 0UL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -7216,7 +8803,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, min_capacity; + unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -7230,6 +8817,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; min_capacity = ULONG_MAX; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -7260,6 +8848,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } min_capacity = min(capacity, min_capacity); + max_capacity = max(capacity, max_capacity); } } else { /* @@ -7273,12 +8862,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity += sgc->capacity; min_capacity = min(sgc->min_capacity, min_capacity); + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = min_capacity; + sdg->sgc->max_capacity = max_capacity; } /* @@ -7384,6 +8975,19 @@ group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) ref->sgc->min_capacity * 1024; } +/* + * group_similar_cpu_capacity: Returns true if the minimum capacity of the + * compared groups differ by less than 12.5%. + */ +static inline bool +group_similar_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + long diff = sg->sgc->min_capacity - ref->sgc->min_capacity; + long max = max(sg->sgc->min_capacity, ref->sgc->min_capacity); + + return abs(diff) < max >> 3; +} + static inline enum group_type group_classify(struct sched_group *group, struct sg_lb_stats *sgs) @@ -7394,6 +8998,9 @@ group_type group_classify(struct sched_group *group, if (sg_imbalanced(group)) return group_imbalanced; + if (sgs->group_misfit_task) + return group_misfit_task; + return group_other; } @@ -7405,11 +9012,12 @@ group_type group_classify(struct sched_group *group, * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. * @overload: Indicate more than one runnable task for any CPU. + * @overutilized: Indicate overutilization for any CPU. */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload) + bool *overload, bool *overutilized, bool *misfit_task) { unsigned long load; int i, nr_running; @@ -7443,6 +9051,17 @@ static inline void update_sg_lb_stats(struct lb_env *env, */ if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; + + if (env->sd->flags & SD_ASYM_CPUCAPACITY && + !sgs->group_misfit_task && rq_has_misfit(rq)) + sgs->group_misfit_task = capacity_of(i); + + if (cpu_overutilized(i)) { + *overutilized = true; + + if (rq_has_misfit(rq)) + *misfit_task = true; + } } /* Adjust by relative CPU capacity of the group */ @@ -7478,6 +9097,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = &sds->busiest_stat; + /* + * Don't try to pull misfit tasks we can't help. + */ + if (sgs->group_type == group_misfit_task && + (!group_smaller_cpu_capacity(sg, sds->local) || + !group_has_capacity(env, &sds->local_stat))) + return false; + if (sgs->group_type > busiest->group_type) return true; @@ -7500,6 +9127,15 @@ static bool update_sd_pick_busiest(struct lb_env *env, group_smaller_cpu_capacity(sds->local, sg)) return false; + /* + * Candidate sg doesn't face any severe imbalance issues so + * don't disturb unless the groups are of similar capacity + * where balancing is more harmless. + */ + if (sgs->group_type == group_other && + !group_similar_cpu_capacity(sds->local, sg)) + return false; + asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) @@ -7557,6 +9193,18 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_NO_HZ_COMMON +static struct { + cpumask_var_t idle_cpus_mask; + atomic_t nr_cpus; + unsigned long next_balance; /* in jiffy units */ + unsigned long next_update; /* in jiffy units */ +} nohz ____cacheline_aligned; +#endif + +#define lb_sd_parent(sd) \ + (sd->parent && sd->parent->groups != sd->parent->groups->next) + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -7569,11 +9217,35 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false; + bool overload = false, overutilized = false, misfit_task = false; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; +#ifdef CONFIG_NO_HZ_COMMON + if (env->idle == CPU_NEWLY_IDLE) { + int cpu; + + /* Update the stats of NOHZ idle CPUs in the sd */ + for_each_cpu_and(cpu, sched_domain_span(env->sd), + nohz.idle_cpus_mask) { + struct rq *rq = cpu_rq(cpu); + + /* ... Unless we've already done since the last tick */ + if (time_after(jiffies, + rq->last_blocked_load_update_tick)) + update_blocked_averages(cpu); + } + } + /* + * If we've just updated all of the NOHZ idle CPUs, then we can push + * back the next nohz.next_update, which will prevent an unnecessary + * wakeup for the nohz stats kick + */ + if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) + nohz.next_update = jiffies + LOAD_AVG_PERIOD; +#endif + load_idx = get_sd_load_idx(env->sd, env->idle); do { @@ -7591,7 +9263,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload); + &overload, &overutilized, + &misfit_task); if (local_group) goto next_group; @@ -7623,6 +9296,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd sds->total_running += sgs->sum_nr_running; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups); @@ -7630,11 +9304,52 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (env->sd->flags & SD_NUMA) env->fbq_type = fbq_classify_group(&sds->busiest_stat); - if (!env->sd->parent) { + env->src_grp_nr_running = sds->busiest_stat.sum_nr_running; + + if (!lb_sd_parent(env->sd)) { /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; } + + if (overutilized) + set_sd_overutilized(env->sd); + else + clear_sd_overutilized(env->sd); + + /* + * If there is a misfit task in one cpu in this sched_domain + * it is likely that the imbalance cannot be sorted out among + * the cpu's in this sched_domain. In this case set the + * overutilized flag at the parent sched_domain. + */ + if (misfit_task) { + struct sched_domain *sd = env->sd->parent; + + /* + * In case of a misfit task, load balance at the parent + * sched domain level will make sense only if the the cpus + * have a different capacity. If cpus at a domain level have + * the same capacity, the misfit task cannot be well + * accomodated in any of the cpus and there in no point in + * trying a load balance at this level + */ + while (sd) { + if (sd->flags & SD_ASYM_CPUCAPACITY) { + set_sd_overutilized(sd); + break; + } + sd = sd->parent; + } + } + + /* + * If the domain util is greater that domain capacity, load balancing + * needs to be done at the next sched domain level as well. + */ + if (lb_sd_parent(env->sd) && + sds->total_capacity * 1024 < sds->total_util * capacity_margin) + set_sd_overutilized(env->sd->parent); } /** @@ -7783,8 +9498,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ - if (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load) { + if (busiest->group_type != group_misfit_task && + (busiest->avg_load <= sds->avg_load || + local->avg_load >= sds->avg_load)) { env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -7818,6 +9534,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task); + } + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -7853,6 +9575,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * this level. */ update_sd_lb_stats(env, &sds); + + if (energy_aware() && !sd_overutilized(env->sd)) + goto out_balanced; + local = &sds.local_stat; busiest = &sds.busiest_stat; @@ -7876,11 +9602,18 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (busiest->group_type == group_imbalanced) goto force_balance; - /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && + /* + * When dst_cpu is idle, prevent SMP nice and/or asymmetric group + * capacities from resulting in underutilization due to avg_load. + */ + if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance; + /* Misfitting tasks should be dealt with regardless of the avg load */ + if (busiest->group_type == group_misfit_task) + goto force_balance; + /* * If the local group is busier than the selected busiest group * don't try and pull any tasks. @@ -7918,6 +9651,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) force_balance: /* Looks like there is an imbalance. Compute it */ + env->src_grp_type = busiest->group_type; calculate_imbalance(env, &sds); return sds.busiest; @@ -7965,6 +9699,13 @@ static struct rq *find_busiest_queue(struct lb_env *env, if (rt > env->fbq_type) continue; + /* + * For ASYM_CPUCAPACITY domains with misfit tasks we ignore + * load. + */ + if (env->src_grp_type == group_misfit_task && rq_has_misfit(rq)) + return rq; + capacity = capacity_of(i); wl = weighted_cpuload(rq); @@ -8034,6 +9775,14 @@ static int need_active_balance(struct lb_env *env) return 1; } + if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && + ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu))) && + env->src_rq->cfs.h_nr_running == 1 && + cpu_overutilized(env->src_cpu) && + !cpu_overutilized(env->dst_cpu)) { + return 1; + } + return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } @@ -8086,7 +9835,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, int *continue_balancing) { int ld_moved, cur_ld_moved, active_balance = 0; - struct sched_domain *sd_parent = sd->parent; + struct sched_domain *sd_parent = lb_sd_parent(sd) ? sd->parent : NULL; struct sched_group *group; struct rq *busiest; struct rq_flags rf; @@ -8252,7 +10001,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, * excessive cache_hot migrations and active balances. */ if (idle != CPU_NEWLY_IDLE) - sd->nr_balance_failed++; + if (env.src_grp_nr_running > 1) + sd->nr_balance_failed++; if (need_active_balance(&env)) { unsigned long flags; @@ -8348,6 +10098,7 @@ static inline unsigned long get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) { unsigned long interval = sd->balance_interval; + unsigned int cpu; if (cpu_busy) interval *= sd->busy_factor; @@ -8356,6 +10107,24 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) interval = msecs_to_jiffies(interval); interval = clamp(interval, 1UL, max_load_balance_interval); + /* + * check if sched domain is marked as overutilized + * we ought to only do this on systems which have SD_ASYMCAPACITY + * but we want to do it for all sched domains in those systems + * So for now, just check if overutilized as a proxy. + */ + /* + * If we are overutilized and we have a misfit task, then + * we want to balance as soon as practically possible, so + * we return an interval of zero. + */ + if (energy_aware() && sd_overutilized(sd)) { + /* we know the root is overutilized, let's check for a misfit task */ + for_each_cpu(cpu, sched_domain_span(sd)) { + if (rq_has_misfit(cpu_rq(cpu))) + return 1; + } + } return interval; } @@ -8589,11 +10358,6 @@ static inline int on_null_domain(struct rq *rq) * needed, they will kick the idle load balancer, which then does idle * load balancing for all the idle CPUs. */ -static struct { - cpumask_var_t idle_cpus_mask; - atomic_t nr_cpus; - unsigned long next_balance; /* in jiffy units */ -} nohz ____cacheline_aligned; static inline int find_new_ilb(void) { @@ -8610,7 +10374,7 @@ static inline int find_new_ilb(void) * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle * CPU (if there is one). */ -static void nohz_balancer_kick(void) +static void nohz_balancer_kick(bool only_update) { int ilb_cpu; @@ -8623,6 +10387,10 @@ static void nohz_balancer_kick(void) if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) return; + + if (only_update) + set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu)); + /* * Use smp_send_reschedule() instead of resched_cpu(). * This way we generate a sched IPI on the target cpu which @@ -8710,6 +10478,8 @@ void nohz_balance_enter_idle(int cpu) atomic_inc(&nohz.nr_cpus); set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)); } +#else +static inline void nohz_balancer_kick(bool only_update) {} #endif static DEFINE_SPINLOCK(balancing); @@ -8741,8 +10511,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) int need_serialize, need_decay = 0; u64 max_cost = 0; - update_blocked_averages(cpu); - rcu_read_lock(); for_each_domain(cpu, sd) { /* @@ -8757,6 +10525,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) } max_cost += sd->max_newidle_lb_cost; + if (energy_aware() && !sd_overutilized(sd)) + continue; + if (!(sd->flags & SD_LOAD_BALANCE)) continue; @@ -8841,6 +10612,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { int this_cpu = this_rq->cpu; struct rq *rq; + struct sched_domain *sd; int balance_cpu; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; @@ -8850,6 +10622,23 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) goto end; + /* + * This cpu is going to update the blocked load of idle CPUs either + * before doing a rebalancing or just to keep metrics up to date. we + * can safely update the next update timestamp + */ + rcu_read_lock(); + sd = rcu_dereference(this_rq->sd); + /* + * Check whether there is a sched_domain available for this cpu. + * The last other cpu can have been unplugged since the ILB has been + * triggered and the sched_domain can now be null. The idle balance + * sequence will quickly be aborted as there is no more idle CPUs + */ + if (sd) + nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD); + rcu_read_unlock(); + for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; @@ -8876,7 +10665,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) cpu_load_update_idle(rq); rq_unlock_irq(rq, &rf); - rebalance_domains(rq, CPU_IDLE); + update_blocked_averages(balance_cpu); + /* + * This idle load balance softirq may have been + * triggered only to update the blocked load and shares + * of idle CPUs (which we have just done for + * balance_cpu). In that case skip the actual balance. + */ + if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu))) + rebalance_domains(rq, idle); } if (time_after(next_balance, rq->next_balance)) { @@ -8907,7 +10704,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ -static inline bool nohz_kick_needed(struct rq *rq) +static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { unsigned long now = jiffies; struct sched_domain_shared *sds; @@ -8915,7 +10712,7 @@ static inline bool nohz_kick_needed(struct rq *rq) int nr_busy, i, cpu = rq->cpu; bool kick = false; - if (unlikely(rq->idle_balance)) + if (unlikely(rq->idle_balance) && !only_update) return false; /* @@ -8932,15 +10729,27 @@ static inline bool nohz_kick_needed(struct rq *rq) if (likely(!atomic_read(&nohz.nr_cpus))) return false; + if (only_update) { + if (time_before(now, nohz.next_update)) + return false; + else + return true; + } + if (time_before(now, nohz.next_balance)) return false; - if (rq->nr_running >= 2) + if (rq->nr_running >= 2 && + (!energy_aware() || cpu_overutilized(cpu))) return true; + /* Do idle load balance if there have misfit task */ + if (energy_aware()) + return rq_has_misfit(rq); + rcu_read_lock(); sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - if (sds) { + if (sds && !energy_aware()) { /* * XXX: write a coherent comment on why we do this. * See also: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.com @@ -8981,6 +10790,7 @@ static inline bool nohz_kick_needed(struct rq *rq) } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } +static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { return false; } #endif /* @@ -9002,7 +10812,14 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h) * and abort nohz_idle_balance altogether if we pull some load. */ nohz_idle_balance(this_rq, idle); + update_blocked_averages(this_rq->cpu); +#ifdef CONFIG_NO_HZ_COMMON + if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu))) + rebalance_domains(this_rq, idle); + clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)); +#else rebalance_domains(this_rq, idle); +#endif } /* @@ -9017,8 +10834,8 @@ void trigger_load_balance(struct rq *rq) if (time_after_eq(jiffies, rq->next_balance)) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ_COMMON - if (nohz_kick_needed(rq)) - nohz_balancer_kick(); + if (nohz_kick_needed(rq, false)) + nohz_balancer_kick(false); #endif } @@ -9054,6 +10871,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); + + update_misfit_task(rq, curr); + + update_overutilized_status(rq); } /* @@ -9597,8 +11418,11 @@ __init void init_sched_fair_class(void) #ifdef CONFIG_NO_HZ_COMMON nohz.next_balance = jiffies; + nohz.next_update = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif + + alloc_eenv(); #endif /* SMP */ }
diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 9552fd5..306333b 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h
@@ -85,3 +85,31 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true) SCHED_FEAT(WA_IDLE, true) SCHED_FEAT(WA_WEIGHT, true) SCHED_FEAT(WA_BIAS, true) + +/* + * Energy aware scheduling. Use platform energy model to guide scheduling + * decisions optimizing for energy efficiency. + */ +#ifdef CONFIG_DEFAULT_USE_ENERGY_AWARE +SCHED_FEAT(ENERGY_AWARE, true) +#else +SCHED_FEAT(ENERGY_AWARE, false) +#endif + +/* + * Energy aware scheduling algorithm choices: + * EAS_PREFER_IDLE + * Direct tasks in a schedtune.prefer_idle=1 group through + * the EAS path for wakeup task placement. Otherwise, put + * those tasks through the mainline slow path. + * FIND_BEST_TARGET + * Limit the number of placement options for which we calculate + * energy by using heuristics to select 'best idle' and + * 'best active' cpu options. + * FBT_STRICT_ORDER + * ON: If the target CPU saves any energy, use that. + * OFF: Use whichever of target or backup saves most. + */ +SCHED_FEAT(EAS_PREFER_IDLE, true) +SCHED_FEAT(FIND_BEST_TARGET, true) +SCHED_FEAT(FBT_STRICT_ORDER, true)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 257f4f0..c1dbb2c 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c
@@ -25,9 +25,10 @@ extern char __cpuidle_text_start[], __cpuidle_text_end[]; * sched_idle_set_state - Record idle state for the current CPU. * @idle_state: State to record. */ -void sched_idle_set_state(struct cpuidle_state *idle_state) +void sched_idle_set_state(struct cpuidle_state *idle_state, int index) { idle_set_state(this_rq(), idle_state); + idle_set_state_idx(this_rq(), index); } static int __read_mostly cpu_idle_force_poll;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d518664..609abdf 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c
@@ -10,7 +10,8 @@ #ifdef CONFIG_SMP static int -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags) +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags, + int sibling_count_hint) { return task_cpu(p); /* IDLE tasks as never migrated */ }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 3c96c80..0e56601 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c
@@ -9,6 +9,8 @@ #include <linux/slab.h> #include <linux/irq_work.h> +#include "walt.h" + int sched_rr_timeslice = RR_TIMESLICE; int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE; @@ -1335,6 +1337,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) rt_se->timeout = 0; enqueue_rt_entity(rt_se, flags); + walt_inc_cumulative_runnable_avg(rq, p); if (!task_current(rq, p) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); @@ -1346,6 +1349,7 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) update_curr_rt(rq); dequeue_rt_entity(rt_se, flags); + walt_dec_cumulative_runnable_avg(rq, p); dequeue_pushable_task(rq, p); } @@ -1388,7 +1392,8 @@ static void yield_task_rt(struct rq *rq) static int find_lowest_rq(struct task_struct *task); static int -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags, + int sibling_count_hint) { struct task_struct *curr; struct rq *rq; @@ -1535,6 +1540,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) return p; } +extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running); + static struct task_struct * pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -1580,6 +1587,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) queue_push_tasks(rq); + if (p) + update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq, + rq->curr->sched_class == &rt_sched_class); + return p; } @@ -1587,6 +1598,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) { update_curr_rt(rq); + update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1); + /* * The previous task needs to be made eligible for pushing * if it is still active @@ -1854,7 +1867,9 @@ static int push_rt_task(struct rq *rq) } deactivate_task(rq, next_task, 0); + next_task->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(next_task, lowest_rq->cpu); + next_task->on_rq = TASK_ON_RQ_QUEUED; activate_task(lowest_rq, next_task, 0); ret = 1; @@ -2189,7 +2204,9 @@ static void pull_rt_task(struct rq *this_rq) resched = true; deactivate_task(src_rq, p, 0); + p->on_rq = TASK_ON_RQ_MIGRATING; set_task_cpu(p, this_cpu); + p->on_rq = TASK_ON_RQ_QUEUED; activate_task(this_rq, p, 0); /* * We continue with the search, just in @@ -2369,6 +2386,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); + update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), &rq->rt, 1); watchdog(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3b448ba..d767020 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h
@@ -483,6 +483,10 @@ struct cfs_rq { struct list_head leaf_cfs_rq_list; struct task_group *tg; /* group that "owns" this runqueue */ +#ifdef CONFIG_SCHED_WALT + u64 cumulative_runnable_avg; +#endif + #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; u64 runtime_expires; @@ -524,6 +528,9 @@ struct rt_rq { unsigned long rt_nr_total; int overloaded; struct plist_head pushable_tasks; + + struct sched_avg avg; + #ifdef HAVE_RT_PUSH_IPI int push_flags; int push_cpu; @@ -646,6 +653,9 @@ struct root_domain { struct cpupri cpupri; unsigned long max_cpu_capacity; + + /* First cpu with maximum and minimum original capacity */ + int max_cap_orig_cpu, min_cap_orig_cpu; }; extern struct root_domain def_root_domain; @@ -682,6 +692,7 @@ struct rq { #ifdef CONFIG_NO_HZ_COMMON #ifdef CONFIG_SMP unsigned long last_load_update_tick; + unsigned long last_blocked_load_update_tick; #endif /* CONFIG_SMP */ unsigned long nohz_flags; #endif /* CONFIG_NO_HZ_COMMON */ @@ -731,6 +742,8 @@ struct rq { struct callback_head *balance_callback; unsigned char idle_balance; + + unsigned int misfit_task; /* For active balancing */ int active_balance; int push_cpu; @@ -750,6 +763,20 @@ struct rq { u64 max_idle_balance_cost; #endif +#ifdef CONFIG_SCHED_WALT + u64 cumulative_runnable_avg; + u64 window_start; + u64 curr_runnable_sum; + u64 prev_runnable_sum; + u64 nt_curr_runnable_sum; + u64 nt_prev_runnable_sum; + u64 cur_irqload; + u64 avg_irqload; + u64 irqload_ts; + u64 cum_window_demand; +#endif /* CONFIG_SCHED_WALT */ + + #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif @@ -797,6 +824,7 @@ struct rq { #ifdef CONFIG_CPU_IDLE /* Must be inspected within a rcu lock section */ struct cpuidle_state *idle_state; + int idle_state_idx; #endif }; @@ -1055,6 +1083,8 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +DECLARE_PER_CPU(struct sched_domain *, sd_ea); +DECLARE_PER_CPU(struct sched_domain *, sd_scs); struct sched_group_capacity { atomic_t ref; @@ -1064,6 +1094,7 @@ struct sched_group_capacity { */ unsigned long capacity; unsigned long min_capacity; /* Min per-CPU capacity in group */ + unsigned long max_capacity; /* Max per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ @@ -1081,6 +1112,7 @@ struct sched_group { unsigned int group_weight; struct sched_group_capacity *sgc; int asym_prefer_cpu; /* cpu of highest priority in group */ + const struct sched_group_energy const *sge; /* * The CPUs this group covers. @@ -1421,7 +1453,8 @@ struct sched_class { void (*put_prev_task) (struct rq *rq, struct task_struct *p); #ifdef CONFIG_SMP - int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); + int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags, + int subling_count_hint); void (*migrate_task_rq)(struct task_struct *p); void (*task_woken) (struct rq *this_rq, struct task_struct *task); @@ -1508,6 +1541,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) SCHED_WARN_ON(!rcu_read_lock_held()); return rq->idle_state; } + +static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx) +{ + rq->idle_state_idx = idle_state_idx; +} + +static inline int idle_get_state_idx(struct rq *rq) +{ + WARN_ON(!rcu_read_lock_held()); + return rq->idle_state_idx; +} #else static inline void idle_set_state(struct rq *rq, struct cpuidle_state *idle_state) @@ -1518,6 +1562,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq) { return NULL; } + +static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx) +{ +} + +static inline int idle_get_state_idx(struct rq *rq) +{ + return -1; +} #endif extern void schedule_idle(void); @@ -1654,6 +1707,7 @@ static inline int hrtick_enabled(struct rq *rq) #ifdef CONFIG_SMP extern void sched_avg_update(struct rq *rq); +extern unsigned long sched_get_rt_rq_util(int cpu); #ifndef arch_scale_freq_capacity static __always_inline @@ -1674,6 +1728,86 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) } #endif +#ifdef CONFIG_SMP +static inline unsigned long capacity_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity; +} + +static inline unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + +extern unsigned int sysctl_sched_use_walt_cpu_util; +extern unsigned int walt_ravg_window; +extern bool walt_disabled; + +/* + * cpu_util returns the amount of capacity of a CPU that is used by CFS + * tasks. The unit of the return value must be the one of capacity so we can + * compare the utilization with the capacity of the CPU that is available for + * CFS task (ie cpu_capacity). + * + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the + * recent utilization of currently non-runnable tasks on a CPU. It represents + * the amount of utilization of a CPU in the range [0..capacity_orig] where + * capacity_orig is the cpu_capacity available at the highest frequency + * (arch_scale_freq_capacity()). + * The utilization of a CPU converges towards a sum equal to or less than the + * current capacity (capacity_curr <= capacity_orig) of the CPU because it is + * the running time on this CPU scaled by capacity_curr. + * + * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even + * higher than capacity_orig because of unfortunate rounding in + * cfs.avg.util_avg or just after migrating tasks and new task wakeups until + * the average stabilizes with the new running time. We need to check that the + * utilization stays within the range of [0..capacity_orig] and cap it if + * necessary. Without utilization capping, a group could be seen as overloaded + * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of + * available capacity. We allow utilization to overshoot capacity_curr (but not + * capacity_orig) as it useful for predicting the capacity required after task + * migrations (scheduler-driven DVFS). + */ +static inline unsigned long __cpu_util(int cpu, int delta) +{ + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; + unsigned long capacity = capacity_orig_of(cpu); + +#ifdef CONFIG_SCHED_WALT + if (!walt_disabled && sysctl_sched_use_walt_cpu_util) { + util = cpu_rq(cpu)->cumulative_runnable_avg << SCHED_CAPACITY_SHIFT; + util = div_u64(util, walt_ravg_window); + } +#endif + delta += util; + if (delta < 0) + return 0; + + return (delta >= capacity) ? capacity : delta; +} + +static inline unsigned long cpu_util(int cpu) +{ + return __cpu_util(cpu, 0); +} + +static inline unsigned long cpu_util_freq(int cpu) +{ + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg; + unsigned long capacity = capacity_orig_of(cpu); + +#ifdef CONFIG_SCHED_WALT + if (!walt_disabled && sysctl_sched_use_walt_cpu_util) { + util = cpu_rq(cpu)->prev_runnable_sum << SCHED_CAPACITY_SHIFT; + do_div(util, walt_ravg_window); + } +#endif + return (util >= capacity) ? capacity : util; +} + +#endif + static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); @@ -1979,6 +2113,7 @@ extern void cfs_bandwidth_usage_dec(void); enum rq_nohz_flag_bits { NOHZ_TICK_STOPPED, NOHZ_BALANCE_KICK, + NOHZ_STATS_KICK }; #define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags) @@ -1990,6 +2125,9 @@ static inline void nohz_balance_exit_idle(unsigned int cpu) { } #ifdef CONFIG_SMP + +extern void init_energy_aware_data(int cpu); + static inline void __dl_update(struct dl_bw *dl_b, s64 bw) { @@ -2083,6 +2221,17 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #endif /* CONFIG_CPU_FREQ */ +#ifdef CONFIG_SCHED_WALT + +static inline bool +walt_task_in_cum_window_demand(struct rq *rq, struct task_struct *p) +{ + return cpu_of(rq) == task_cpu(p) && + (p->on_rq || p->last_sleep_ts >= rq->window_start); +} + +#endif /* CONFIG_SCHED_WALT */ + #ifdef arch_scale_freq_capacity #ifndef arch_scale_freq_invariant #define arch_scale_freq_invariant() (true)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 45caf90..7ca03e5 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c
@@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include "sched.h" +#include "walt.h" /* * stop-task scheduling class. @@ -12,7 +13,8 @@ #ifdef CONFIG_SMP static int -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags) +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags, + int sibling_count_hint) { return task_cpu(p); /* stop tasks as never migrate */ } @@ -43,12 +45,14 @@ static void enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags) { add_nr_running(rq, 1); + walt_inc_cumulative_runnable_avg(rq, p); } static void dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags) { sub_nr_running(rq, 1); + walt_dec_cumulative_runnable_avg(rq, p); } static void yield_task_stop(struct rq *rq)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 6798276..f2f36d8 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c
@@ -39,9 +39,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, if (!(sd->flags & SD_LOAD_BALANCE)) { printk("does not load-balance\n"); - if (sd->parent) - printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain" - " has parent"); return -1; } @@ -154,8 +151,12 @@ static inline bool sched_debug(void) static int sd_degenerate(struct sched_domain *sd) { - if (cpumask_weight(sched_domain_span(sd)) == 1) - return 1; + if (cpumask_weight(sched_domain_span(sd)) == 1) { + if (sd->groups->sge) + sd->flags &= ~SD_LOAD_BALANCE; + else + return 1; + } /* Following flags need at least 2 groups */ if (sd->flags & (SD_LOAD_BALANCE | @@ -165,7 +166,8 @@ static int sd_degenerate(struct sched_domain *sd) SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | - SD_SHARE_POWERDOMAIN)) { + SD_SHARE_POWERDOMAIN | + SD_SHARE_CAP_STATES)) { if (sd->groups != sd->groups->next) return 0; } @@ -198,7 +200,12 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_PREFER_SIBLING | - SD_SHARE_POWERDOMAIN); + SD_SHARE_POWERDOMAIN | + SD_SHARE_CAP_STATES); + if (parent->groups->sge) { + parent->flags &= ~SD_LOAD_BALANCE; + return 0; + } if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } @@ -275,6 +282,9 @@ static int init_rootdomain(struct root_domain *rd) if (cpupri_init(&rd->cpupri) != 0) goto free_cpudl; + + rd->max_cap_orig_cpu = rd->min_cap_orig_cpu = -1; + return 0; free_cpudl: @@ -386,11 +396,14 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_PER_CPU(struct sched_domain *, sd_ea); +DEFINE_PER_CPU(struct sched_domain *, sd_scs); static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds = NULL; struct sched_domain *sd; + struct sched_domain *ea_sd = NULL; int id = cpu; int size = 1; @@ -411,6 +424,17 @@ static void update_top_cache_domain(int cpu) sd = highest_flag_domain(cpu, SD_ASYM_PACKING); rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); + + for_each_domain(cpu, sd) { + if (sd->groups->sge) + ea_sd = sd; + else + break; + } + rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd); + + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); + rcu_assign_pointer(per_cpu(sd_scs, cpu), sd); } /* @@ -695,6 +719,7 @@ static void init_overlap_sched_group(struct sched_domain *sd, sg_span = sched_group_span(sg); sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; + sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; } static int @@ -943,6 +968,108 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd) update_group_capacity(sd, cpu); } +#define cap_state_power(s,i) (s->cap_states[i].power) +#define cap_state_cap(s,i) (s->cap_states[i].cap) +#define idle_state_power(s,i) (s->idle_states[i].power) + +static inline int sched_group_energy_equal(const struct sched_group_energy *a, + const struct sched_group_energy *b) +{ + int i; + + /* check pointers first */ + if (a == b) + return true; + + /* check contents are equivalent */ + if (a->nr_cap_states != b->nr_cap_states) + return false; + if (a->nr_idle_states != b->nr_idle_states) + return false; + for (i=0;i<a->nr_cap_states;i++){ + if (cap_state_power(a,i) != + cap_state_power(b,i)) + return false; + if (cap_state_cap(a,i) != + cap_state_cap(b,i)) + return false; + } + for (i=0;i<a->nr_idle_states;i++){ + if (idle_state_power(a,i) != + idle_state_power(b,i)) + return false; + } + + return true; +} + +#define energy_eff(e, n) \ + ((e->cap_states[n].cap << SCHED_CAPACITY_SHIFT)/e->cap_states[n].power) + +static void init_sched_groups_energy(int cpu, struct sched_domain *sd, + sched_domain_energy_f fn) +{ + struct sched_group *sg = sd->groups; + const struct sched_group_energy *sge; + int i; + + if (!(fn && fn(cpu))) + return; + + if (cpu != group_balance_cpu(sg)) + return; + + if (sd->flags & SD_OVERLAP) { + pr_err("BUG: EAS does not support overlapping sd spans\n"); +#ifdef CONFIG_SCHED_DEBUG + pr_err(" the %s domain has SD_OVERLAP set\n", sd->name); +#endif + return; + } + + if (sd->child && !sd->child->groups->sge) { + pr_err("BUG: EAS setup borken for CPU%d\n", cpu); +#ifdef CONFIG_SCHED_DEBUG + pr_err(" energy data on %s but not on %s domain\n", + sd->name, sd->child->name); +#endif + return; + } + + sge = fn(cpu); + + /* + * Check that the per-cpu provided sd energy data is consistent for all + * cpus within the mask. + */ + if (cpumask_weight(sched_group_span(sg)) > 1) { + struct cpumask mask; + + cpumask_xor(&mask, sched_group_span(sg), get_cpu_mask(cpu)); + + for_each_cpu(i, &mask) + BUG_ON(!sched_group_energy_equal(sge,fn(i))); + } + + /* Check that energy efficiency (capacity/power) is monotonically + * decreasing in the capacity state vector with higher indexes + */ + for (i = 0; i < (sge->nr_cap_states - 1); i++) { + if (energy_eff(sge, i) > energy_eff(sge, i+1)) + continue; +#ifdef CONFIG_SCHED_DEBUG + pr_warn("WARN: cpu=%d, domain=%s: incr. energy eff %lu[%d]->%lu[%d]\n", + cpu, sd->name, energy_eff(sge, i), i, + energy_eff(sge, i+1), i+1); +#else + pr_warn("WARN: cpu=%d: incr. energy eff %lu[%d]->%lu[%d]\n", + cpu, energy_eff(sge, i), i, energy_eff(sge, i+1), i+1); +#endif + } + + sd->groups->sge = fn(cpu); +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -1062,6 +1189,7 @@ static int sched_domains_curr_level; * SD_NUMA - describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain * SD_ASYM_CPUCAPACITY - describes mixed capacity topologies + * SD_SHARE_CAP_STATES - describes shared capacity states * * Odd one out, which beside describing the topology has a quirk also * prescribes the desired behaviour that goes along with it: @@ -1074,7 +1202,8 @@ static int sched_domains_curr_level; SD_NUMA | \ SD_ASYM_PACKING | \ SD_ASYM_CPUCAPACITY | \ - SD_SHARE_POWERDOMAIN) + SD_SHARE_POWERDOMAIN | \ + SD_SHARE_CAP_STATES) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, @@ -1183,15 +1312,11 @@ sd_init(struct sched_domain_topology_level *tl, sd->idle_idx = 1; } - /* - * For all levels sharing cache; connect a sched_domain_shared - * instance. - */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->shared = *per_cpu_ptr(sdd->sds, sd_id); - atomic_inc(&sd->shared->ref); + sd->shared = *per_cpu_ptr(sdd->sds, sd_id); + atomic_inc(&sd->shared->ref); + + if (sd->flags & SD_SHARE_PKG_RESOURCES) atomic_set(&sd->shared->nr_busy_cpus, sd_weight); - } sd->private = sdd; @@ -1650,8 +1775,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) sd->flags |= SD_OVERLAP; - if (cpumask_equal(cpu_map, sched_domain_span(sd))) - break; } } @@ -1671,10 +1794,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { + struct sched_domain_topology_level *tl = sched_domain_topology; + if (!cpumask_test_cpu(i, cpu_map)) continue; - for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) { + init_sched_groups_energy(i, sd, tl->energy); claim_allocations(i, sd); init_sched_groups_capacity(i, sd); } @@ -1683,6 +1809,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { + int max_cpu = READ_ONCE(d.rd->max_cap_orig_cpu); + int min_cpu = READ_ONCE(d.rd->min_cap_orig_cpu); + rq = cpu_rq(i); sd = *per_cpu_ptr(d.sd, i); @@ -1690,6 +1819,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity)) WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig); + if ((max_cpu < 0) || (cpu_rq(i)->cpu_capacity_orig > + cpu_rq(max_cpu)->cpu_capacity_orig)) + WRITE_ONCE(d.rd->max_cap_orig_cpu, i); + + if ((min_cpu < 0) || (cpu_rq(i)->cpu_capacity_orig < + cpu_rq(min_cpu)->cpu_capacity_orig)) + WRITE_ONCE(d.rd->min_cap_orig_cpu, i); + cpu_attach_domain(sd, d.rd, i); } rcu_read_unlock();
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c new file mode 100644 index 0000000..2e6ef5f --- /dev/null +++ b/kernel/sched/tune.c
@@ -0,0 +1,559 @@ +#include <linux/cgroup.h> +#include <linux/err.h> +#include <linux/kernel.h> +#include <linux/percpu.h> +#include <linux/printk.h> +#include <linux/rcupdate.h> +#include <linux/slab.h> + +#include <trace/events/sched.h> + +#include "sched.h" +#include "tune.h" + +bool schedtune_initialized = false; +extern struct reciprocal_value schedtune_spc_rdiv; + +/* + * EAS scheduler tunables for task groups. + */ + +/* SchdTune tunables for a group of tasks */ +struct schedtune { + /* SchedTune CGroup subsystem */ + struct cgroup_subsys_state css; + + /* Boost group allocated ID */ + int idx; + + /* Boost value for tasks on that SchedTune CGroup */ + int boost; + + /* Hint to bias scheduling of tasks on that SchedTune CGroup + * towards idle CPUs */ + int prefer_idle; +}; + +static inline struct schedtune *css_st(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct schedtune, css) : NULL; +} + +static inline struct schedtune *task_schedtune(struct task_struct *tsk) +{ + return css_st(task_css(tsk, schedtune_cgrp_id)); +} + +static inline struct schedtune *parent_st(struct schedtune *st) +{ + return css_st(st->css.parent); +} + +/* + * SchedTune root control group + * The root control group is used to defined a system-wide boosting tuning, + * which is applied to all tasks in the system. + * Task specific boost tuning could be specified by creating and + * configuring a child control group under the root one. + * By default, system-wide boosting is disabled, i.e. no boosting is applied + * to tasks which are not into a child control group. + */ +static struct schedtune +root_schedtune = { + .boost = 0, + .prefer_idle = 0, +}; + +/* + * Maximum number of boost groups to support + * When per-task boosting is used we still allow only limited number of + * boost groups for two main reasons: + * 1. on a real system we usually have only few classes of workloads which + * make sense to boost with different values (e.g. background vs foreground + * tasks, interactive vs low-priority tasks) + * 2. a limited number allows for a simpler and more memory/time efficient + * implementation especially for the computation of the per-CPU boost + * value + */ +#define BOOSTGROUPS_COUNT 5 + +/* Array of configured boostgroups */ +static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = { + &root_schedtune, + NULL, +}; + +/* SchedTune boost groups + * Keep track of all the boost groups which impact on CPU, for example when a + * CPU has two RUNNABLE tasks belonging to two different boost groups and thus + * likely with different boost values. + * Since on each system we expect only a limited number of boost groups, here + * we use a simple array to keep track of the metrics required to compute the + * maximum per-CPU boosting value. + */ +struct boost_groups { + /* Maximum boost value for all RUNNABLE tasks on a CPU */ + bool idle; + int boost_max; + struct { + /* The boost for tasks on that boost group */ + int boost; + /* Count of RUNNABLE tasks on that boost group */ + unsigned tasks; + } group[BOOSTGROUPS_COUNT]; + /* CPU's boost group locking */ + raw_spinlock_t lock; +}; + +/* Boost groups affecting each CPU in the system */ +DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups); + +static void +schedtune_cpu_update(int cpu) +{ + struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu); + int boost_max; + int idx; + + /* The root boost group is always active */ + boost_max = bg->group[0].boost; + for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) { + /* + * A boost group affects a CPU only if it has + * RUNNABLE tasks on that CPU + */ + if (bg->group[idx].tasks == 0) + continue; + + boost_max = max(boost_max, bg->group[idx].boost); + } + /* Ensures boost_max is non-negative when all cgroup boost values + * are neagtive. Avoids under-accounting of cpu capacity which may cause + * task stacking and frequency spikes.*/ + boost_max = max(boost_max, 0); + bg->boost_max = boost_max; +} + +static int +schedtune_boostgroup_update(int idx, int boost) +{ + struct boost_groups *bg; + int cur_boost_max; + int old_boost; + int cpu; + + /* Update per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + + /* + * Keep track of current boost values to compute the per CPU + * maximum only when it has been affected by the new value of + * the updated boost group + */ + cur_boost_max = bg->boost_max; + old_boost = bg->group[idx].boost; + + /* Update the boost value of this boost group */ + bg->group[idx].boost = boost; + + /* Check if this update increase current max */ + if (boost > cur_boost_max && bg->group[idx].tasks) { + bg->boost_max = boost; + trace_sched_tune_boostgroup_update(cpu, 1, bg->boost_max); + continue; + } + + /* Check if this update has decreased current max */ + if (cur_boost_max == old_boost && old_boost > boost) { + schedtune_cpu_update(cpu); + trace_sched_tune_boostgroup_update(cpu, -1, bg->boost_max); + continue; + } + + trace_sched_tune_boostgroup_update(cpu, 0, bg->boost_max); + } + + return 0; +} + +#define ENQUEUE_TASK 1 +#define DEQUEUE_TASK -1 + +static inline void +schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count) +{ + struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu); + int tasks = bg->group[idx].tasks + task_count; + + /* Update boosted tasks count while avoiding to make it negative */ + bg->group[idx].tasks = max(0, tasks); + + trace_sched_tune_tasks_update(p, cpu, tasks, idx, + bg->group[idx].boost, bg->boost_max); + + /* Boost group activation or deactivation on that RQ */ + if (tasks == 1 || tasks == 0) + schedtune_cpu_update(cpu); +} + +/* + * NOTE: This function must be called while holding the lock on the CPU RQ + */ +void schedtune_enqueue_task(struct task_struct *p, int cpu) +{ + struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu); + unsigned long irq_flags; + struct schedtune *st; + int idx; + + if (unlikely(!schedtune_initialized)) + return; + + /* + * Boost group accouting is protected by a per-cpu lock and requires + * interrupt to be disabled to avoid race conditions for example on + * do_exit()::cgroup_exit() and task migration. + */ + raw_spin_lock_irqsave(&bg->lock, irq_flags); + rcu_read_lock(); + + st = task_schedtune(p); + idx = st->idx; + + schedtune_tasks_update(p, cpu, idx, ENQUEUE_TASK); + + rcu_read_unlock(); + raw_spin_unlock_irqrestore(&bg->lock, irq_flags); +} + +int schedtune_can_attach(struct cgroup_taskset *tset) +{ + struct task_struct *task; + struct cgroup_subsys_state *css; + struct boost_groups *bg; + struct rq_flags rq_flags; + unsigned int cpu; + struct rq *rq; + int src_bg; /* Source boost group index */ + int dst_bg; /* Destination boost group index */ + int tasks; + + if (unlikely(!schedtune_initialized)) + return 0; + + + cgroup_taskset_for_each(task, css, tset) { + + /* + * Lock the CPU's RQ the task is enqueued to avoid race + * conditions with migration code while the task is being + * accounted + */ + rq = task_rq_lock(task, &rq_flags); + + if (!task->on_rq) { + task_rq_unlock(rq, task, &rq_flags); + continue; + } + + /* + * Boost group accouting is protected by a per-cpu lock and requires + * interrupt to be disabled to avoid race conditions on... + */ + cpu = cpu_of(rq); + bg = &per_cpu(cpu_boost_groups, cpu); + raw_spin_lock(&bg->lock); + + dst_bg = css_st(css)->idx; + src_bg = task_schedtune(task)->idx; + + /* + * Current task is not changing boostgroup, which can + * happen when the new hierarchy is in use. + */ + if (unlikely(dst_bg == src_bg)) { + raw_spin_unlock(&bg->lock); + task_rq_unlock(rq, task, &rq_flags); + continue; + } + + /* + * This is the case of a RUNNABLE task which is switching its + * current boost group. + */ + + /* Move task from src to dst boost group */ + tasks = bg->group[src_bg].tasks - 1; + bg->group[src_bg].tasks = max(0, tasks); + bg->group[dst_bg].tasks += 1; + + raw_spin_unlock(&bg->lock); + task_rq_unlock(rq, task, &rq_flags); + + /* Update CPU boost group */ + if (bg->group[src_bg].tasks == 0 || bg->group[dst_bg].tasks == 1) + schedtune_cpu_update(task_cpu(task)); + + } + + return 0; +} + +void schedtune_cancel_attach(struct cgroup_taskset *tset) +{ + /* This can happen only if SchedTune controller is mounted with + * other hierarchies ane one of them fails. Since usually SchedTune is + * mouted on its own hierarcy, for the time being we do not implement + * a proper rollback mechanism */ + WARN(1, "SchedTune cancel attach not implemented"); +} + +/* + * NOTE: This function must be called while holding the lock on the CPU RQ + */ +void schedtune_dequeue_task(struct task_struct *p, int cpu) +{ + struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu); + unsigned long irq_flags; + struct schedtune *st; + int idx; + + if (unlikely(!schedtune_initialized)) + return; + + /* + * Boost group accouting is protected by a per-cpu lock and requires + * interrupt to be disabled to avoid race conditions on... + */ + raw_spin_lock_irqsave(&bg->lock, irq_flags); + rcu_read_lock(); + + st = task_schedtune(p); + idx = st->idx; + + schedtune_tasks_update(p, cpu, idx, DEQUEUE_TASK); + + rcu_read_unlock(); + raw_spin_unlock_irqrestore(&bg->lock, irq_flags); +} + +int schedtune_cpu_boost(int cpu) +{ + struct boost_groups *bg; + + bg = &per_cpu(cpu_boost_groups, cpu); + return bg->boost_max; +} + +int schedtune_task_boost(struct task_struct *p) +{ + struct schedtune *st; + int task_boost; + + if (unlikely(!schedtune_initialized)) + return 0; + + /* Get task boost value */ + rcu_read_lock(); + st = task_schedtune(p); + task_boost = st->boost; + rcu_read_unlock(); + + return task_boost; +} + +int schedtune_prefer_idle(struct task_struct *p) +{ + struct schedtune *st; + int prefer_idle; + + if (unlikely(!schedtune_initialized)) + return 0; + + /* Get prefer_idle value */ + rcu_read_lock(); + st = task_schedtune(p); + prefer_idle = st->prefer_idle; + rcu_read_unlock(); + + return prefer_idle; +} + +static u64 +prefer_idle_read(struct cgroup_subsys_state *css, struct cftype *cft) +{ + struct schedtune *st = css_st(css); + + return st->prefer_idle; +} + +static int +prefer_idle_write(struct cgroup_subsys_state *css, struct cftype *cft, + u64 prefer_idle) +{ + struct schedtune *st = css_st(css); + st->prefer_idle = prefer_idle; + + return 0; +} + +static s64 +boost_read(struct cgroup_subsys_state *css, struct cftype *cft) +{ + struct schedtune *st = css_st(css); + + return st->boost; +} + +static int +boost_write(struct cgroup_subsys_state *css, struct cftype *cft, + s64 boost) +{ + struct schedtune *st = css_st(css); + + if (boost < 0 || boost > 100) + return -EINVAL; + + st->boost = boost; + + /* Update CPU boost */ + schedtune_boostgroup_update(st->idx, st->boost); + + return 0; +} + +static struct cftype files[] = { + { + .name = "boost", + .read_s64 = boost_read, + .write_s64 = boost_write, + }, + { + .name = "prefer_idle", + .read_u64 = prefer_idle_read, + .write_u64 = prefer_idle_write, + }, + { } /* terminate */ +}; + +static int +schedtune_boostgroup_init(struct schedtune *st) +{ + struct boost_groups *bg; + int cpu; + + /* Keep track of allocated boost groups */ + allocated_group[st->idx] = st; + + /* Initialize the per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + bg->group[st->idx].boost = 0; + bg->group[st->idx].tasks = 0; + raw_spin_lock_init(&bg->lock); + } + + return 0; +} + +static struct cgroup_subsys_state * +schedtune_css_alloc(struct cgroup_subsys_state *parent_css) +{ + struct schedtune *st; + int idx; + + if (!parent_css) + return &root_schedtune.css; + + /* Allow only single level hierachies */ + if (parent_css != &root_schedtune.css) { + pr_err("Nested SchedTune boosting groups not allowed\n"); + return ERR_PTR(-ENOMEM); + } + + /* Allow only a limited number of boosting groups */ + for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) + if (!allocated_group[idx]) + break; + if (idx == BOOSTGROUPS_COUNT) { + pr_err("Trying to create more than %d SchedTune boosting groups\n", + BOOSTGROUPS_COUNT); + return ERR_PTR(-ENOSPC); + } + + st = kzalloc(sizeof(*st), GFP_KERNEL); + if (!st) + goto out; + + /* Initialize per CPUs boost group support */ + st->idx = idx; + if (schedtune_boostgroup_init(st)) + goto release; + + return &st->css; + +release: + kfree(st); +out: + return ERR_PTR(-ENOMEM); +} + +static void +schedtune_boostgroup_release(struct schedtune *st) +{ + /* Reset this boost group */ + schedtune_boostgroup_update(st->idx, 0); + + /* Keep track of allocated boost groups */ + allocated_group[st->idx] = NULL; +} + +static void +schedtune_css_free(struct cgroup_subsys_state *css) +{ + struct schedtune *st = css_st(css); + + schedtune_boostgroup_release(st); + kfree(st); +} + +struct cgroup_subsys schedtune_cgrp_subsys = { + .css_alloc = schedtune_css_alloc, + .css_free = schedtune_css_free, + .can_attach = schedtune_can_attach, + .cancel_attach = schedtune_cancel_attach, + .legacy_cftypes = files, + .early_init = 1, +}; + +static inline void +schedtune_init_cgroups(void) +{ + struct boost_groups *bg; + int cpu; + + /* Initialize the per CPU boost groups */ + for_each_possible_cpu(cpu) { + bg = &per_cpu(cpu_boost_groups, cpu); + memset(bg, 0, sizeof(struct boost_groups)); + raw_spin_lock_init(&bg->lock); + } + + pr_info("schedtune: configured to support %d boost groups\n", + BOOSTGROUPS_COUNT); + + schedtune_initialized = true; +} + +/* + * Initialize the cgroup structures + */ +static int +schedtune_init(void) +{ + schedtune_spc_rdiv = reciprocal_value(100); + schedtune_init_cgroups(); + return 0; +} +postcore_initcall(schedtune_init);
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h new file mode 100644 index 0000000..e79e1b1 --- /dev/null +++ b/kernel/sched/tune.h
@@ -0,0 +1,33 @@ + +#ifdef CONFIG_SCHED_TUNE + +#include <linux/reciprocal_div.h> + +/* + * System energy normalization constants + */ +struct target_nrg { + unsigned long min_power; + unsigned long max_power; + struct reciprocal_value rdiv; +}; + +int schedtune_cpu_boost(int cpu); +int schedtune_task_boost(struct task_struct *tsk); + +int schedtune_prefer_idle(struct task_struct *tsk); + +void schedtune_enqueue_task(struct task_struct *p, int cpu); +void schedtune_dequeue_task(struct task_struct *p, int cpu); + +#else /* CONFIG_SCHED_TUNE */ + +#define schedtune_cpu_boost(cpu) 0 +#define schedtune_task_boost(tsk) 0 + +#define schedtune_prefer_idle(tsk) 0 + +#define schedtune_enqueue_task(task, cpu) do { } while (0) +#define schedtune_dequeue_task(task, cpu) do { } while (0) + +#endif /* CONFIG_SCHED_TUNE */
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c new file mode 100644 index 0000000..01a411b --- /dev/null +++ b/kernel/sched/walt.c
@@ -0,0 +1,921 @@ +/* + * Copyright (c) 2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + * Window Assisted Load Tracking (WALT) implementation credits: + * Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park, + * Pavan Kumar Kondeti, Olav Haugan + * + * 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla + * and Todd Kjos + */ + +#include <linux/acpi.h> +#include <linux/syscore_ops.h> +#include <trace/events/sched.h> +#include "sched.h" +#include "walt.h" + +#define WINDOW_STATS_RECENT 0 +#define WINDOW_STATS_MAX 1 +#define WINDOW_STATS_MAX_RECENT_AVG 2 +#define WINDOW_STATS_AVG 3 +#define WINDOW_STATS_INVALID_POLICY 4 + +#define EXITING_TASK_MARKER 0xdeaddead + +static __read_mostly unsigned int walt_ravg_hist_size = 5; +static __read_mostly unsigned int walt_window_stats_policy = + WINDOW_STATS_MAX_RECENT_AVG; +static __read_mostly unsigned int walt_account_wait_time = 1; +static __read_mostly unsigned int walt_freq_account_wait_time = 0; +static __read_mostly unsigned int walt_io_is_busy = 0; + +unsigned int sysctl_sched_walt_init_task_load_pct = 15; + +/* true -> use PELT based load stats, false -> use window-based load stats */ +bool __read_mostly walt_disabled = false; + +/* + * Window size (in ns). Adjust for the tick size so that the window + * rollover occurs just before the tick boundary. + */ +__read_mostly unsigned int walt_ravg_window = + (20000000 / TICK_NSEC) * TICK_NSEC; +#define MIN_SCHED_RAVG_WINDOW ((10000000 / TICK_NSEC) * TICK_NSEC) +#define MAX_SCHED_RAVG_WINDOW ((1000000000 / TICK_NSEC) * TICK_NSEC) + +static unsigned int sync_cpu; +static ktime_t ktime_last; +static __read_mostly bool walt_ktime_suspended; + +static unsigned int task_load(struct task_struct *p) +{ + return p->ravg.demand; +} + +static inline void fixup_cum_window_demand(struct rq *rq, s64 delta) +{ + rq->cum_window_demand += delta; + if (unlikely((s64)rq->cum_window_demand < 0)) + rq->cum_window_demand = 0; +} + +void +walt_inc_cumulative_runnable_avg(struct rq *rq, + struct task_struct *p) +{ + rq->cumulative_runnable_avg += p->ravg.demand; + + /* + * Add a task's contribution to the cumulative window demand when + * + * (1) task is enqueued with on_rq = 1 i.e migration, + * prio/cgroup/class change. + * (2) task is waking for the first time in this window. + */ + if (p->on_rq || (p->last_sleep_ts < rq->window_start)) + fixup_cum_window_demand(rq, p->ravg.demand); +} + +void +walt_dec_cumulative_runnable_avg(struct rq *rq, + struct task_struct *p) +{ + rq->cumulative_runnable_avg -= p->ravg.demand; + BUG_ON((s64)rq->cumulative_runnable_avg < 0); + + /* + * on_rq will be 1 for sleeping tasks. So check if the task + * is migrating or dequeuing in RUNNING state to change the + * prio/cgroup/class. + */ + if (task_on_rq_migrating(p) || p->state == TASK_RUNNING) + fixup_cum_window_demand(rq, -(s64)p->ravg.demand); +} + +static void +fixup_cumulative_runnable_avg(struct rq *rq, + struct task_struct *p, u64 new_task_load) +{ + s64 task_load_delta = (s64)new_task_load - task_load(p); + + rq->cumulative_runnable_avg += task_load_delta; + if ((s64)rq->cumulative_runnable_avg < 0) + panic("cra less than zero: tld: %lld, task_load(p) = %u\n", + task_load_delta, task_load(p)); + + fixup_cum_window_demand(rq, task_load_delta); +} + +u64 walt_ktime_clock(void) +{ + if (unlikely(walt_ktime_suspended)) + return ktime_to_ns(ktime_last); + return ktime_get_ns(); +} + +static void walt_resume(void) +{ + walt_ktime_suspended = false; +} + +static int walt_suspend(void) +{ + ktime_last = ktime_get(); + walt_ktime_suspended = true; + return 0; +} + +static struct syscore_ops walt_syscore_ops = { + .resume = walt_resume, + .suspend = walt_suspend +}; + +static int __init walt_init_ops(void) +{ + register_syscore_ops(&walt_syscore_ops); + return 0; +} +late_initcall(walt_init_ops); + +void walt_inc_cfs_cumulative_runnable_avg(struct cfs_rq *cfs_rq, + struct task_struct *p) +{ + cfs_rq->cumulative_runnable_avg += p->ravg.demand; +} + +void walt_dec_cfs_cumulative_runnable_avg(struct cfs_rq *cfs_rq, + struct task_struct *p) +{ + cfs_rq->cumulative_runnable_avg -= p->ravg.demand; +} + +static int exiting_task(struct task_struct *p) +{ + if (p->flags & PF_EXITING) { + if (p->ravg.sum_history[0] != EXITING_TASK_MARKER) { + p->ravg.sum_history[0] = EXITING_TASK_MARKER; + } + return 1; + } + return 0; +} + +static int __init set_walt_ravg_window(char *str) +{ + unsigned int adj_window; + bool no_walt = walt_disabled; + + get_option(&str, &walt_ravg_window); + + /* Adjust for CONFIG_HZ */ + adj_window = (walt_ravg_window / TICK_NSEC) * TICK_NSEC; + + /* Warn if we're a bit too far away from the expected window size */ + WARN(adj_window < walt_ravg_window - NSEC_PER_MSEC, + "tick-adjusted window size %u, original was %u\n", adj_window, + walt_ravg_window); + + walt_ravg_window = adj_window; + + walt_disabled = walt_disabled || + (walt_ravg_window < MIN_SCHED_RAVG_WINDOW || + walt_ravg_window > MAX_SCHED_RAVG_WINDOW); + + WARN(!no_walt && walt_disabled, + "invalid window size, disabling WALT\n"); + + return 0; +} + +early_param("walt_ravg_window", set_walt_ravg_window); + +static void +update_window_start(struct rq *rq, u64 wallclock) +{ + s64 delta; + int nr_windows; + + delta = wallclock - rq->window_start; + /* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */ + if (delta < 0) { + delta = 0; + WARN_ONCE(1, "WALT wallclock appears to have gone backwards or reset\n"); + } + + if (delta < walt_ravg_window) + return; + + nr_windows = div64_u64(delta, walt_ravg_window); + rq->window_start += (u64)nr_windows * (u64)walt_ravg_window; + + rq->cum_window_demand = rq->cumulative_runnable_avg; +} + +extern unsigned long capacity_curr_of(int cpu); +/* + * Translate absolute delta time accounted on a CPU + * to a scale where 1024 is the capacity of the most + * capable CPU running at FMAX + */ +static u64 scale_exec_time(u64 delta, struct rq *rq) +{ + unsigned long capcurr = capacity_curr_of(cpu_of(rq)); + + return (delta * capcurr) >> SCHED_CAPACITY_SHIFT; +} + +static int cpu_is_waiting_on_io(struct rq *rq) +{ + if (!walt_io_is_busy) + return 0; + + return atomic_read(&rq->nr_iowait); +} + +void walt_account_irqtime(int cpu, struct task_struct *curr, + u64 delta, u64 wallclock) +{ + struct rq *rq = cpu_rq(cpu); + unsigned long flags, nr_windows; + u64 cur_jiffies_ts; + + raw_spin_lock_irqsave(&rq->lock, flags); + + /* + * cputime (wallclock) uses sched_clock so use the same here for + * consistency. + */ + delta += sched_clock() - wallclock; + cur_jiffies_ts = get_jiffies_64(); + + if (is_idle_task(curr)) + walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), + delta); + + nr_windows = cur_jiffies_ts - rq->irqload_ts; + + if (nr_windows) { + if (nr_windows < 10) { + /* Decay CPU's irqload by 3/4 for each window. */ + rq->avg_irqload *= (3 * nr_windows); + rq->avg_irqload = div64_u64(rq->avg_irqload, + 4 * nr_windows); + } else { + rq->avg_irqload = 0; + } + rq->avg_irqload += rq->cur_irqload; + rq->cur_irqload = 0; + } + + rq->cur_irqload += delta; + rq->irqload_ts = cur_jiffies_ts; + raw_spin_unlock_irqrestore(&rq->lock, flags); +} + + +#define WALT_HIGH_IRQ_TIMEOUT 3 + +u64 walt_irqload(int cpu) { + struct rq *rq = cpu_rq(cpu); + s64 delta; + delta = get_jiffies_64() - rq->irqload_ts; + + /* + * Current context can be preempted by irq and rq->irqload_ts can be + * updated by irq context so that delta can be negative. + * But this is okay and we can safely return as this means there + * was recent irq occurrence. + */ + + if (delta < WALT_HIGH_IRQ_TIMEOUT) + return rq->avg_irqload; + else + return 0; +} + +int walt_cpu_high_irqload(int cpu) { + return walt_irqload(cpu) >= sysctl_sched_walt_cpu_high_irqload; +} + +static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p, + u64 irqtime, int event) +{ + if (is_idle_task(p)) { + /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */ + if (event == PICK_NEXT_TASK) + return 0; + + /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */ + return irqtime || cpu_is_waiting_on_io(rq); + } + + if (event == TASK_WAKE) + return 0; + + if (event == PUT_PREV_TASK || event == IRQ_UPDATE || + event == TASK_UPDATE) + return 1; + + /* Only TASK_MIGRATE && PICK_NEXT_TASK left */ + return walt_freq_account_wait_time; +} + +/* + * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum) + */ +static void update_cpu_busy_time(struct task_struct *p, struct rq *rq, + int event, u64 wallclock, u64 irqtime) +{ + int new_window, nr_full_windows = 0; + int p_is_curr_task = (p == rq->curr); + u64 mark_start = p->ravg.mark_start; + u64 window_start = rq->window_start; + u32 window_size = walt_ravg_window; + u64 delta; + + new_window = mark_start < window_start; + if (new_window) { + nr_full_windows = div64_u64((window_start - mark_start), + window_size); + if (p->ravg.active_windows < USHRT_MAX) + p->ravg.active_windows++; + } + + /* Handle per-task window rollover. We don't care about the idle + * task or exiting tasks. */ + if (new_window && !is_idle_task(p) && !exiting_task(p)) { + u32 curr_window = 0; + + if (!nr_full_windows) + curr_window = p->ravg.curr_window; + + p->ravg.prev_window = curr_window; + p->ravg.curr_window = 0; + } + + if (!account_busy_for_cpu_time(rq, p, irqtime, event)) { + /* account_busy_for_cpu_time() = 0, so no update to the + * task's current window needs to be made. This could be + * for example + * + * - a wakeup event on a task within the current + * window (!new_window below, no action required), + * - switching to a new task from idle (PICK_NEXT_TASK) + * in a new window where irqtime is 0 and we aren't + * waiting on IO */ + + if (!new_window) + return; + + /* A new window has started. The RQ demand must be rolled + * over if p is the current task. */ + if (p_is_curr_task) { + u64 prev_sum = 0; + + /* p is either idle task or an exiting task */ + if (!nr_full_windows) { + prev_sum = rq->curr_runnable_sum; + } + + rq->prev_runnable_sum = prev_sum; + rq->curr_runnable_sum = 0; + } + + return; + } + + if (!new_window) { + /* account_busy_for_cpu_time() = 1 so busy time needs + * to be accounted to the current window. No rollover + * since we didn't start a new window. An example of this is + * when a task starts execution and then sleeps within the + * same window. */ + + if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) + delta = wallclock - mark_start; + else + delta = irqtime; + delta = scale_exec_time(delta, rq); + rq->curr_runnable_sum += delta; + if (!is_idle_task(p) && !exiting_task(p)) + p->ravg.curr_window += delta; + + return; + } + + if (!p_is_curr_task) { + /* account_busy_for_cpu_time() = 1 so busy time needs + * to be accounted to the current window. A new window + * has also started, but p is not the current task, so the + * window is not rolled over - just split up and account + * as necessary into curr and prev. The window is only + * rolled over when a new window is processed for the current + * task. + * + * Irqtime can't be accounted by a task that isn't the + * currently running task. */ + + if (!nr_full_windows) { + /* A full window hasn't elapsed, account partial + * contribution to previous completed window. */ + delta = scale_exec_time(window_start - mark_start, rq); + if (!exiting_task(p)) + p->ravg.prev_window += delta; + } else { + /* Since at least one full window has elapsed, + * the contribution to the previous window is the + * full window (window_size). */ + delta = scale_exec_time(window_size, rq); + if (!exiting_task(p)) + p->ravg.prev_window = delta; + } + rq->prev_runnable_sum += delta; + + /* Account piece of busy time in the current window. */ + delta = scale_exec_time(wallclock - window_start, rq); + rq->curr_runnable_sum += delta; + if (!exiting_task(p)) + p->ravg.curr_window = delta; + + return; + } + + if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) { + /* account_busy_for_cpu_time() = 1 so busy time needs + * to be accounted to the current window. A new window + * has started and p is the current task so rollover is + * needed. If any of these three above conditions are true + * then this busy time can't be accounted as irqtime. + * + * Busy time for the idle task or exiting tasks need not + * be accounted. + * + * An example of this would be a task that starts execution + * and then sleeps once a new window has begun. */ + + if (!nr_full_windows) { + /* A full window hasn't elapsed, account partial + * contribution to previous completed window. */ + delta = scale_exec_time(window_start - mark_start, rq); + if (!is_idle_task(p) && !exiting_task(p)) + p->ravg.prev_window += delta; + + delta += rq->curr_runnable_sum; + } else { + /* Since at least one full window has elapsed, + * the contribution to the previous window is the + * full window (window_size). */ + delta = scale_exec_time(window_size, rq); + if (!is_idle_task(p) && !exiting_task(p)) + p->ravg.prev_window = delta; + + } + /* + * Rollover for normal runnable sum is done here by overwriting + * the values in prev_runnable_sum and curr_runnable_sum. + * Rollover for new task runnable sum has completed by previous + * if-else statement. + */ + rq->prev_runnable_sum = delta; + + /* Account piece of busy time in the current window. */ + delta = scale_exec_time(wallclock - window_start, rq); + rq->curr_runnable_sum = delta; + if (!is_idle_task(p) && !exiting_task(p)) + p->ravg.curr_window = delta; + + return; + } + + if (irqtime) { + /* account_busy_for_cpu_time() = 1 so busy time needs + * to be accounted to the current window. A new window + * has started and p is the current task so rollover is + * needed. The current task must be the idle task because + * irqtime is not accounted for any other task. + * + * Irqtime will be accounted each time we process IRQ activity + * after a period of idleness, so we know the IRQ busy time + * started at wallclock - irqtime. */ + + BUG_ON(!is_idle_task(p)); + mark_start = wallclock - irqtime; + + /* Roll window over. If IRQ busy time was just in the current + * window then that is all that need be accounted. */ + rq->prev_runnable_sum = rq->curr_runnable_sum; + if (mark_start > window_start) { + rq->curr_runnable_sum = scale_exec_time(irqtime, rq); + return; + } + + /* The IRQ busy time spanned multiple windows. Process the + * busy time preceding the current window start first. */ + delta = window_start - mark_start; + if (delta > window_size) + delta = window_size; + delta = scale_exec_time(delta, rq); + rq->prev_runnable_sum += delta; + + /* Process the remaining IRQ busy time in the current window. */ + delta = wallclock - window_start; + rq->curr_runnable_sum = scale_exec_time(delta, rq); + + return; + } + + BUG(); +} + +static int account_busy_for_task_demand(struct task_struct *p, int event) +{ + /* No need to bother updating task demand for exiting tasks + * or the idle task. */ + if (exiting_task(p) || is_idle_task(p)) + return 0; + + /* When a task is waking up it is completing a segment of non-busy + * time. Likewise, if wait time is not treated as busy time, then + * when a task begins to run or is migrated, it is not running and + * is completing a segment of non-busy time. */ + if (event == TASK_WAKE || (!walt_account_wait_time && + (event == PICK_NEXT_TASK || event == TASK_MIGRATE))) + return 0; + + return 1; +} + +/* + * Called when new window is starting for a task, to record cpu usage over + * recently concluded window(s). Normally 'samples' should be 1. It can be > 1 + * when, say, a real-time task runs without preemption for several windows at a + * stretch. + */ +static void update_history(struct rq *rq, struct task_struct *p, + u32 runtime, int samples, int event) +{ + u32 *hist = &p->ravg.sum_history[0]; + int ridx, widx; + u32 max = 0, avg, demand; + u64 sum = 0; + + /* Ignore windows where task had no activity */ + if (!runtime || is_idle_task(p) || exiting_task(p) || !samples) + goto done; + + /* Push new 'runtime' value onto stack */ + widx = walt_ravg_hist_size - 1; + ridx = widx - samples; + for (; ridx >= 0; --widx, --ridx) { + hist[widx] = hist[ridx]; + sum += hist[widx]; + if (hist[widx] > max) + max = hist[widx]; + } + + for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) { + hist[widx] = runtime; + sum += hist[widx]; + if (hist[widx] > max) + max = hist[widx]; + } + + p->ravg.sum = 0; + + if (walt_window_stats_policy == WINDOW_STATS_RECENT) { + demand = runtime; + } else if (walt_window_stats_policy == WINDOW_STATS_MAX) { + demand = max; + } else { + avg = div64_u64(sum, walt_ravg_hist_size); + if (walt_window_stats_policy == WINDOW_STATS_AVG) + demand = avg; + else + demand = max(avg, runtime); + } + + /* + * A throttled deadline sched class task gets dequeued without + * changing p->on_rq. Since the dequeue decrements hmp stats + * avoid decrementing it here again. + * + * When window is rolled over, the cumulative window demand + * is reset to the cumulative runnable average (contribution from + * the tasks on the runqueue). If the current task is dequeued + * already, it's demand is not included in the cumulative runnable + * average. So add the task demand separately to cumulative window + * demand. + */ + if (!task_has_dl_policy(p) || !p->dl.dl_throttled) { + if (task_on_rq_queued(p)) + fixup_cumulative_runnable_avg(rq, p, demand); + else if (rq->curr == p) + fixup_cum_window_demand(rq, demand); + } + + p->ravg.demand = demand; + +done: + trace_walt_update_history(rq, p, runtime, samples, event); + return; +} + +static void add_to_task_demand(struct rq *rq, struct task_struct *p, + u64 delta) +{ + delta = scale_exec_time(delta, rq); + p->ravg.sum += delta; + if (unlikely(p->ravg.sum > walt_ravg_window)) + p->ravg.sum = walt_ravg_window; +} + +/* + * Account cpu demand of task and/or update task's cpu demand history + * + * ms = p->ravg.mark_start; + * wc = wallclock + * ws = rq->window_start + * + * Three possibilities: + * + * a) Task event is contained within one window. + * window_start < mark_start < wallclock + * + * ws ms wc + * | | | + * V V V + * |---------------| + * + * In this case, p->ravg.sum is updated *iff* event is appropriate + * (ex: event == PUT_PREV_TASK) + * + * b) Task event spans two windows. + * mark_start < window_start < wallclock + * + * ms ws wc + * | | | + * V V V + * -----|------------------- + * + * In this case, p->ravg.sum is updated with (ws - ms) *iff* event + * is appropriate, then a new window sample is recorded followed + * by p->ravg.sum being set to (wc - ws) *iff* event is appropriate. + * + * c) Task event spans more than two windows. + * + * ms ws_tmp ws wc + * | | | | + * V V V V + * ---|-------|-------|-------|-------|------ + * | | + * |<------ nr_full_windows ------>| + * + * In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff* + * event is appropriate, window sample of p->ravg.sum is recorded, + * 'nr_full_window' samples of window_size is also recorded *iff* + * event is appropriate and finally p->ravg.sum is set to (wc - ws) + * *iff* event is appropriate. + * + * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time() + * depends on it! + */ +static void update_task_demand(struct task_struct *p, struct rq *rq, + int event, u64 wallclock) +{ + u64 mark_start = p->ravg.mark_start; + u64 delta, window_start = rq->window_start; + int new_window, nr_full_windows; + u32 window_size = walt_ravg_window; + + new_window = mark_start < window_start; + if (!account_busy_for_task_demand(p, event)) { + if (new_window) + /* If the time accounted isn't being accounted as + * busy time, and a new window started, only the + * previous window need be closed out with the + * pre-existing demand. Multiple windows may have + * elapsed, but since empty windows are dropped, + * it is not necessary to account those. */ + update_history(rq, p, p->ravg.sum, 1, event); + return; + } + + if (!new_window) { + /* The simple case - busy time contained within the existing + * window. */ + add_to_task_demand(rq, p, wallclock - mark_start); + return; + } + + /* Busy time spans at least two windows. Temporarily rewind + * window_start to first window boundary after mark_start. */ + delta = window_start - mark_start; + nr_full_windows = div64_u64(delta, window_size); + window_start -= (u64)nr_full_windows * (u64)window_size; + + /* Process (window_start - mark_start) first */ + add_to_task_demand(rq, p, window_start - mark_start); + + /* Push new sample(s) into task's demand history */ + update_history(rq, p, p->ravg.sum, 1, event); + if (nr_full_windows) + update_history(rq, p, scale_exec_time(window_size, rq), + nr_full_windows, event); + + /* Roll window_start back to current to process any remainder + * in current window. */ + window_start += (u64)nr_full_windows * (u64)window_size; + + /* Process (wallclock - window_start) next */ + mark_start = window_start; + add_to_task_demand(rq, p, wallclock - mark_start); +} + +/* Reflect task activity on its demand and cpu's busy time statistics */ +void walt_update_task_ravg(struct task_struct *p, struct rq *rq, + int event, u64 wallclock, u64 irqtime) +{ + if (walt_disabled || !rq->window_start) + return; + + /* there's a bug here - there are many cases where + * we enter here without holding this lock, coming from + * walt_fixup_busy_time - looks like in 4.14 we don't + * hold the dest_rq at time of migration, but I haven't + * yet worked out if it is safe to always lock dest_rq there. + * + * temporarily disable this assert to continue checking the + * rest of the locking here. + */ + //lockdep_assert_held(&rq->lock); + + update_window_start(rq, wallclock); + + if (!p->ravg.mark_start) + goto done; + + update_task_demand(p, rq, event, wallclock); + update_cpu_busy_time(p, rq, event, wallclock, irqtime); + +done: + trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime); + + p->ravg.mark_start = wallclock; +} + +static void reset_task_stats(struct task_struct *p) +{ + u32 sum = 0; + + if (exiting_task(p)) + sum = EXITING_TASK_MARKER; + + memset(&p->ravg, 0, sizeof(struct ravg)); + /* Retain EXITING_TASK marker */ + p->ravg.sum_history[0] = sum; +} + +void walt_mark_task_starting(struct task_struct *p) +{ + u64 wallclock; + struct rq *rq = task_rq(p); + + if (!rq->window_start) { + reset_task_stats(p); + return; + } + + wallclock = walt_ktime_clock(); + p->ravg.mark_start = wallclock; +} + +void walt_set_window_start(struct rq *rq, struct rq_flags *rf) +{ + if (likely(rq->window_start)) + return; + + if (cpu_of(rq) == sync_cpu) { + rq->window_start = 1; + } else { + struct rq *sync_rq = cpu_rq(sync_cpu); + rq_unpin_lock(rq, rf); + double_lock_balance(rq, sync_rq); + rq->window_start = sync_rq->window_start; + rq->curr_runnable_sum = rq->prev_runnable_sum = 0; + raw_spin_unlock(&sync_rq->lock); + rq_repin_lock(rq, rf); + } + + rq->curr->ravg.mark_start = rq->window_start; +} + +void walt_migrate_sync_cpu(int cpu) +{ + if (cpu == sync_cpu) + sync_cpu = smp_processor_id(); +} + +void walt_fixup_busy_time(struct task_struct *p, int new_cpu) +{ + struct rq *src_rq = task_rq(p); + struct rq *dest_rq = cpu_rq(new_cpu); + u64 wallclock; + + if (!p->on_rq && p->state != TASK_WAKING) + return; + + if (exiting_task(p)) { + return; + } + + if (p->state == TASK_WAKING) + double_rq_lock(src_rq, dest_rq); + + wallclock = walt_ktime_clock(); + +//#define LOCK_CONDITION(rq) (debug_locks && !lockdep_is_held(&rq->lock)) +// WARN(LOCK_CONDITION(task_rq(p)), "task_rq(p) not held. p->state=%08lx new_cpu=%d task_cpu=%d", p->state, new_cpu, p->cpu); +// WARN(LOCK_CONDITION(dest_rq), "dest_rq not held. p->state=%08lx new_cpu=%d task_cpu=%d", p->state, new_cpu, p->cpu); + + /* + * It seems that in lots of cases we don't have + * dest_rq locked when we get here, which means + * we can't be sure to the WALT stats - someone + * needs to fix this. + */ + walt_update_task_ravg(task_rq(p)->curr, task_rq(p), + TASK_UPDATE, wallclock, 0); + walt_update_task_ravg(dest_rq->curr, dest_rq, + TASK_UPDATE, wallclock, 0); + +// WARN(LOCK_CONDITION(task_rq(p)), "task_rq(p) not held after rq update. p->state=%08lx new_cpu=%d task_cpu=%d", p->state, new_cpu, p->cpu); + walt_update_task_ravg(p, task_rq(p), TASK_MIGRATE, wallclock, 0); + + /* + * When a task is migrating during the wakeup, adjust + * the task's contribution towards cumulative window + * demand. + */ + if (p->state == TASK_WAKING && + p->last_sleep_ts >= src_rq->window_start) { + fixup_cum_window_demand(src_rq, -(s64)p->ravg.demand); + fixup_cum_window_demand(dest_rq, p->ravg.demand); + } + + if (p->ravg.curr_window) { + src_rq->curr_runnable_sum -= p->ravg.curr_window; + dest_rq->curr_runnable_sum += p->ravg.curr_window; + } + + if (p->ravg.prev_window) { + src_rq->prev_runnable_sum -= p->ravg.prev_window; + dest_rq->prev_runnable_sum += p->ravg.prev_window; + } + + if ((s64)src_rq->prev_runnable_sum < 0) { + src_rq->prev_runnable_sum = 0; + WARN_ON(1); + } + if ((s64)src_rq->curr_runnable_sum < 0) { + src_rq->curr_runnable_sum = 0; + WARN_ON(1); + } + + trace_walt_migration_update_sum(src_rq, p); + trace_walt_migration_update_sum(dest_rq, p); + + if (p->state == TASK_WAKING) + double_rq_unlock(src_rq, dest_rq); +} + +void walt_init_new_task_load(struct task_struct *p) +{ + int i; + u32 init_load_windows = + div64_u64((u64)sysctl_sched_walt_init_task_load_pct * + (u64)walt_ravg_window, 100); + u32 init_load_pct = current->init_load_pct; + + p->init_load_pct = 0; + memset(&p->ravg, 0, sizeof(struct ravg)); + + if (init_load_pct) { + init_load_windows = div64_u64((u64)init_load_pct * + (u64)walt_ravg_window, 100); + } + + p->ravg.demand = init_load_windows; + for (i = 0; i < RAVG_HIST_SIZE_MAX; ++i) + p->ravg.sum_history[i] = init_load_windows; +}
diff --git a/kernel/sched/walt.h b/kernel/sched/walt.h new file mode 100644 index 0000000..c7a4ef9 --- /dev/null +++ b/kernel/sched/walt.h
@@ -0,0 +1,62 @@ +/* + * Copyright (c) 2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef __WALT_H +#define __WALT_H + +#ifdef CONFIG_SCHED_WALT + +void walt_update_task_ravg(struct task_struct *p, struct rq *rq, int event, + u64 wallclock, u64 irqtime); +void walt_inc_cumulative_runnable_avg(struct rq *rq, struct task_struct *p); +void walt_dec_cumulative_runnable_avg(struct rq *rq, struct task_struct *p); +void walt_inc_cfs_cumulative_runnable_avg(struct cfs_rq *rq, + struct task_struct *p); +void walt_dec_cfs_cumulative_runnable_avg(struct cfs_rq *rq, + struct task_struct *p); +void walt_fixup_busy_time(struct task_struct *p, int new_cpu); +void walt_init_new_task_load(struct task_struct *p); +void walt_mark_task_starting(struct task_struct *p); +void walt_set_window_start(struct rq *rq, struct rq_flags *rf); +void walt_migrate_sync_cpu(int cpu); +u64 walt_ktime_clock(void); +void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta, + u64 wallclock); + +u64 walt_irqload(int cpu); +int walt_cpu_high_irqload(int cpu); + +#else /* CONFIG_SCHED_WALT */ + +static inline void walt_update_task_ravg(struct task_struct *p, struct rq *rq, + int event, u64 wallclock, u64 irqtime) { } +static inline void walt_inc_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { } +static inline void walt_dec_cumulative_runnable_avg(struct rq *rq, struct task_struct *p) { } +static inline void walt_inc_cfs_cumulative_runnable_avg(struct cfs_rq *rq, + struct task_struct *p) { } +static inline void walt_dec_cfs_cumulative_runnable_avg(struct cfs_rq *rq, + struct task_struct *p) { } +static inline void walt_fixup_busy_time(struct task_struct *p, int new_cpu) { } +static inline void walt_init_new_task_load(struct task_struct *p) { } +static inline void walt_mark_task_starting(struct task_struct *p) { } +static inline void walt_set_window_start(struct rq *rq, struct rq_flags *rf) { } +static inline void walt_migrate_sync_cpu(int cpu) { } +static inline u64 walt_ktime_clock(void) { return 0; } + +#define walt_cpu_high_irqload(cpu) false + +#endif /* CONFIG_SCHED_WALT */ + +extern bool walt_disabled; + +#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d9c31bc..bc15c2a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c
@@ -329,6 +329,50 @@ static struct ctl_table kern_table[] = { .extra1 = &min_sched_granularity_ns, .extra2 = &max_sched_granularity_ns, }, +#ifdef CONFIG_SCHED_WALT + { + .procname = "sched_use_walt_cpu_util", + .data = &sysctl_sched_use_walt_cpu_util, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_use_walt_task_util", + .data = &sysctl_sched_use_walt_task_util, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_walt_init_task_load_pct", + .data = &sysctl_sched_walt_init_task_load_pct, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_walt_cpu_high_irqload", + .data = &sysctl_sched_walt_cpu_high_irqload, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif + { + .procname = "sched_sync_hint_enable", + .data = &sysctl_sched_sync_hint_enable, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "sched_cstate_aware", + .data = &sysctl_sched_cstate_aware, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity,