blob: 790addc20a535d063e593935ff095e256d1564ff [file] [log] [blame]
<html devsite>
<head>
<title>Evaluating Performance</title>
<meta name="project_path" value="/_project.yaml" />
<meta name="book_path" value="/_book.yaml" />
</head>
<body>
<!--
Copyright 2017 The Android Open Source Project
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Use
<a href="https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md" class="external">Simpleperf</a>
to evaluate the performance of a device. Simpleperf is a native profiling tool for both
applications and native processes on Android. Use
<a href="https://developer.android.com/studio/profile/cpu-profiler" class="external">CPU Profiler</a>
to inspect app CPU usage and thread activity in real time.</p>
<p>There are two user-visible indicators of performance:</p>
<ul>
<li><strong>Predictable, perceptible performance</strong>. Does the user
interface (UI) drop frames or consistently render at 60FPS? Does audio play
without artifacts or popping? How long is the delay between the user touching
the screen and the effect showing on the display?</li>
<li><strong>Length of time required for longer operations</strong> (such as
opening applications).</li>
</ul>
<p>The first is more noticeable than the second. Users typically notice jank
but they won't be able to tell 500ms vs 600ms application startup time unless
they are looking at two devices side-by-side. Touch latency is immediately
noticeable and significantly contributes to the perception of a device.</p>
<p>As a result, in a fast device, the UI pipeline is the most important thing in
the system other than what is necessary to keep the UI pipeline functional. This
means that the UI pipeline should preempt any other work that is not necessary
for fluid UI. To maintain a fluid UI, background syncing, notification delivery,
and similar work must all be delayed if UI work can be run. It is
acceptable to trade the performance of longer operations (HDR+ runtime,
application startup, etc.) to maintain a fluid UI.</p>
<h2 id="capacity_vs_jitter">Capacity vs jitter</h2>
<p>When considering device performance, <em>capacity</em> and <em>jitter</em>
are two meaningful metrics.</p>
<h3 id="capacity">Capacity</h3>
<p>Capacity is the total amount of some resource that the device possesses over
some amount of time. This can be CPU resources, GPU resources, I/O resources,
network resources, memory bandwidth, or any similar metric. When examining
whole-system performance, it can be useful to abstract the individual components
and assume a single metric that determines performance (especially when tuning a
new device because the workloads run on that device are likely fixed).</p>
<p>The capacity of a system varies based on the computing resources online.
Changing CPU/GPU frequency is the primary means of changing capacity, but there
are others such as changing the number of CPU cores online. Accordingly, the
capacity of a system corresponds with power consumption; <strong>changing
capacity always results in a similar change in power consumption.</strong></p>
<p>The capacity required at a given time is overwhelmingly determined by the
running application. As a result, the platform can do little to adjust the
capacity required for a given workload, and the means to do so are limited to
runtime improvements (Android framework, ART, Bionic, GPU compiler/drivers,
kernel).</p>
<h3 id="jitter">Jitter</h3>
<p>While the required capacity for a workload is easy to see, jitter is a more
nebulous concept. For a good introduction to jitter as an impediment to fast
systems, refer to
<em><a href="http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-03-3116">THE
CASE OF THE MISSING SUPERCOMPUTER PERFORMANCE: ACHIEVING OPTIMAL PERFORMANCE ON
THE 8,192 PROCESSORS OF ASCl Q</em></a>. (It's an investigation of why the ASCI
Q supercomputer did not achieve its expected performance and is a great
introduction to optimizing large systems.)</p>
<p>This page uses the term jitter to describe what the ASCI Q paper calls
<em>noise</em>. Jitter is the random system behavior that prevents perceptible
work from running. It is often work that must be run, but it may not have strict
timing requirements that cause it to run at any particular time. Because it is
random, it is extremely difficult to disprove the existence of jitter for a
given workload. It is also extremely difficult to prove that a known source of
jitter was the cause of a particular performance issue. The tools most commonly
used for diagnosing causes of jitter (such as tracing or logging) can introduce
their own jitter.</p>
<p>Sources of jitter experienced in real-world implementations of Android
include:</p>
<ul>
<li>Scheduler delay</li>
<li>Interrupt handlers</li>
<li>Driver code running for too long with preemption or interrupts disabled</li>
<li>Long-running softirqs</li>
<li>Lock contention (application, framework, kernel driver, binder lock, mmap
lock)</li>
<li>File descriptor contention where a low-priority thread holds the lock on a
file, preventing a high-priority thread from running</li>
<li>Running UI-critical code in workqueues where it could be delayed</li>
<li>CPU idle transitions</li>
<li>Logging</li>
<li>I/O delays</li>
<li>Unnecessary process creation (e.g., CONNECTIVITY_CHANGE broadcasts)</li>
<li>Page cache thrashing caused by insufficient free memory</li>
</ul>
<p>The required amount of time for a given period of jitter may or may not
decrease as capacity increases. For example, if a driver leaves interrupts
disabled while waiting for a read from across an i2c bus, it will take a fixed
amount of time regardless of whether the CPU is at 384MHz or 2GHz. Increasing
capacity is not a feasible solution to improve performance when jitter is
involved. As a result, <strong>faster processors will not usually improve
performance in jitter-constrained situations.</strong></p>
<p>Finally, unlike capacity, jitter is almost entirely within the domain of the
system vendor.</p>
<h3 id="memory_consumption">Memory consumption</h3>
<p>Memory consumption is traditionally blamed for poor performance. While
consumption itself is not a performance issue, it can cause jitter via
lowmemorykiller overhead, service restarts, and page cache thrashing. Reducing
memory consumption can avoid the direct causes of poor performance, but there
may be other targeted improvements that avoid those causes as well (for example,
pinning the framework to prevent it from being paged out when it will be paged
in soon after).</p>
<h2 id="analyze_initial">Analyzing initial device performance</h2>
<p>Starting from a functional but poorly-performing system and attempting to fix
the system's behavior by looking at individual cases of user-visible poor
performance is <strong>not</strong> a sound strategy. Because poor performance
is usually not easily reproducible (i.e., jitter) or an application issue, too
many variables in the full system prevent this strategy from being effective. As
a result, it's very easy to misidentify causes and make minor improvements while
missing systemic opportunities for fixing performance across the system.</p>
<p>Instead, use the following general approach when bringing up a new
device:</p>
<ol>
<li>Get the system booting to UI with all drivers running and some basic
frequency governor settings (if you change the frequency governor settings,
repeat all steps below).</li>
<li>Ensure the kernel supports the <code>sched_blocked_reason</code> tracepoint
as well as other tracepoints in the display pipeline that denote when the frame
is delivered to the display.</li>
<li>Take long traces of the entire UI pipeline (from receiving input via an IRQ
to final scanout) while running a lightweight and consistent workload (e.g.,
<a href="https://android.googlesource.com/platform/frameworks/base.git/+/master/tests/UiBench/">UiBench</a>
or the ball test in <a href="#touchlatency">TouchLatency)</a>.</li>
<li>Fix the frame drops detected in the lightweight and consistent
workload.</li>
<li>Repeat steps 3-4 until you can run with zero dropped frames for 20+ seconds
at a time. </li>
<li>Move on to other user-visible sources of jank.</li>
</ol>
<p>Other simple things you can do early on in device bringup include:</p>
<ul>
<li>Ensure your kernel has the
<a href="https://android.googlesource.com/kernel/msm/+/c9f00aa0e25e397533c198a0fcf6246715f99a7b%5E!/">sched_blocked_reason
tracepoint patch</a>. This tracepoint is enabled with the sched trace category
in systrace and provides the function responsible for sleeping when that
thread enters uninterruptible sleep. It is critical for performance analysis
because uninterruptible sleep is a very common indicator of jitter.</li>
<li>Ensure you have sufficient tracing for the GPU and display pipelines. On
recent Qualcomm SOCs, tracepoints are enabled using:</li>
<pre class="devsite-click-to-copy">
<code class="devsite-terminal">adb shell "echo 1 &gt; /d/tracing/events/kgsl/enable"</code>
<code class="devsite-terminal">adb shell "echo 1 &gt; /d/tracing/events/mdss/enable"</code>
</pre>
<p>These events remain enabled when you run systrace so you can see additional
information in the trace about the display pipeline (MDSS) in the
<code>mdss_fb0</code> section. On Qualcomm SOCs, you won't see any additional
information about the GPU in the standard systrace view, but the results are
present in the trace itself (for details, see
<a href="/devices/tech/debug/systrace.html">Understanding
systrace</a>).</p>
<p>What you want from this kind of display tracing is a single event that
directly indicates a frame has been delivered to the display. From there, you
can determine if you've hit your frame time successfully; if event X<em>n</em>
occurs less than 16.7ms after event X<em>n-1</em> (assuming a 60Hz display),
then you know you did not jank. If your SOC does not provide such signals, work
with your vendor to get them. Debugging jitter is extremely difficult without a
definitive signal of frame completion.</p></ul>
<h3 id="synthetic_benchmarks">Using synthetic benchmarks</h3>
<p>Synthetic benchmarks are useful for ensuring a device's basic functionality
is present. However, treating benchmarks as a proxy for perceived device
performance is not useful.</p>
<p>Based on experiences with SOCs, differences in synthetic benchmark
performance between SOCs is not correlated with a similar difference in
perceptible UI performance (number of dropped frames, 99th percentile frame
time, etc.). Synthetic benchmarks are capacity-only benchmarks; jitter impacts
the measured performance of these benchmarks only by stealing time from the bulk
operation of the benchmark. As a result, synthetic benchmark scores are mostly
irrelevant as a metric of user-perceived performance.</p>
<p>Consider two SOCs running Benchmark X that renders 1000 frames of UI and
reports the total rendering time (lower score is better).</p>
<ul>
<li>SOC 1 renders each frame of Benchmark X in 10ms and scores 10,000.</li>
<li>SOC 2 renders 99% of frames in 1ms but 1% of frames in 100ms and scores
19,900, a dramatically better score.</li>
</ul>
<p>If the benchmark is indicative of actual UI performance, SOC 2 would be
unusable. Assuming a 60Hz refresh rate, SOC 2 would have a janky frame every
1.5s of operation. Meanwhile, SOC 1 (the slower SOC according to Benchmark X)
would be perfectly fluid.</p>
<h3 id="bug_reports">Using bug reports</h3>
<p>Bug reports are sometimes useful for performance analysis, but because they
are so heavyweight, they are rarely useful for debugging sporadic jank issues.
They may provide some hints on what the system was doing at a given time,
especially if the jank was around an application transition (which is logged in
a bug report). Bug reports can also indicate when something is more broadly
wrong with the system that could reduce its effective capacity (such as thermal
throttling or memory fragmentation).</p>
<h3 id="touchlatency">Using TouchLatency</h3>
<p>Several examples of bad behavior come from TouchLatency, which is the
preferred periodic workload used for the Pixel and Pixel XL. It's available at
<code>frameworks/base/tests/TouchLatency</code> and has two modes: touch latency
and bouncing ball (to switch modes, click the button in the upper-right
corner).</p>
<p>The bouncing ball test is exactly as simple as it appears: A ball bounces
around the screen forever, regardless of user input. It is usually also
<strong>by far</strong> the hardest test to run perfectly, but the closer it
comes to running without any dropped frames, the better your device will be. The
bouncing ball test is difficult because it is a trivial but perfectly consistent
workload that runs at a very low clock (this assumes device has a frequency
governor; if the device is instead running with fixed clocks, downclock the
CPU/GPU to near-minimum when running the bouncing ball test for the first time).
As the system quiesces and the clocks drop closer to idle, the required CPU/GPU
time per frame increases. You can watch the ball and see things jank, and you'll
be able to see missed frames in systrace as well.</p>
<p>Because the workload is so consistent, you can identify most sources of
jitter much more easily than in most user-visible workloads by tracking what
exactly is running on the system during each missed frame instead of the UI
pipeline. <strong>The lower clocks amplify the effects of jitter by making it
more likely that any jitter causes a dropped frame.</strong> As a result, the
closer TouchLatency is to 60FPS, the less likely you are to have bad system
behaviors that cause sporadic, hard-to-reproduce jank in larger
applications.</p>
<p>As jitter is often (but not always) clockspeed-invariant, use a test that
runs at very low clocks to diagnose jitter for the following reasons:</p>
<ul>
<li>Not all jitter is clockspeed-invariant; many sources just consume CPU
time.</li>
<li>The governor should get the average frame time close to the deadline by
clocking down, so time spent running non-UI work can push it over the edge to
dropping a frame.</li>
</ul>
</body>
</html>