| <html devsite> |
| <head> |
| <title>Understanding Systrace</title> |
| <meta name="project_path" value="/_project.yaml" /> |
| <meta name="book_path" value="/_book.yaml" /> |
| </head> |
| <body> |
| <!-- |
| Copyright 2017 The Android Open Source Project |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| |
| <p class="caution"><strong>Caution:</strong> If you've never used systrace |
| before, we strongly recommend reading the |
| <a href="https://developer.android.com/studio/profile/systrace.html">systrace |
| overview</a> before continuing.</p> |
| |
| <p>systrace is the primary tool for analyzing Android device performance. |
| However, it's really a wrapper around other tools: It is the host-side wrapper |
| around <em>atrace</em>, the device-side executable that controls userspace |
| tracing and sets up <em>ftrace</em>, the primary tracing mechanism in the Linux |
| kernel. systrace uses atrace to enable tracing, then reads the ftrace buffer and |
| wraps it all in a self-contained HTML viewer. (While newer kernels have support |
| for Linux Enhanced Berkeley Packet Filter (eBPF), the following documentation |
| pertains to the 3.18 kernel (no eFPF) as that's what was used on the Pixel/Pixel |
| XL.)</p> |
| |
| <p>systrace is owned by the Google Android and Google Chrome teams and is |
| developed in the open as part of the |
| <a href="https://github.com/catapult-project/catapult">Catapult project</a>. In |
| addition to systrace, Catapult includes other useful utilities. For example, |
| ftrace has more features than can be directly enabled by systrace or atrace and |
| contains some advanced functionality that is critical to debugging performance |
| problems. (These features require root access and often a new kernel.)</p> |
| |
| <h2 id="running_systrace">Running systrace</h2> |
| <p>When debugging jitter on Pixel/Pixel XL, start with the following |
| command:</p> |
| |
| <pre class="devsite-terminal devsite-click-to-copy"> |
| ./systrace.py sched freq idle am wm gfx view sync binder_driver irq workq input -b 96000 |
| </pre> |
| |
| <p>When combined with the additional tracepoints required for GPU and display |
| pipeline activity, this gives you the ability to trace from user input to frame |
| displayed on screen. Set the buffer size to something large to avoid |
| losing events (which usually manifests as some CPUs containing no events after |
| some point in the trace).</p> |
| |
| <p>When going through systrace, keep in mind that <strong>every event is |
| triggered by something on the CPU</strong>.</p> |
| |
| <p class="note"><strong>Note:</strong> Hardware interrupts are not controlled by |
| the CPU and do trigger things in ftrace, but the actual commit to the trace log |
| is done by the interrupt handler, which could have been delayed if your |
| interrupt arrived while (for example) some other bad driver had interrupts |
| disabled. The critical element is the CPU.</p> |
| |
| <p>Because systrace is built on top of ftrace and ftrace runs on the CPU, |
| something on the CPU must write the ftrace buffer that logs hardware changes. |
| This means that if you're curious about why a display fence changed state, you |
| can see what was running on the CPU at the exact point of its transition |
| (something that is running on the CPU triggered that change in the log). This |
| concept is the foundation of analyzing performance using systrace.</p> |
| |
| <h2 id="example_1">Example: Working frame</h2> |
| <p>This example describes a systrace for a normal UI pipeline. To follow along |
| with the example, <a href="perf_traces.zip">download the zip file</a> of traces |
| (which also includes other traces referred to in this section), upzip the file, |
| and open the systrace_tutorial.html file in your browser. Be warned this |
| systrace is a large file; unless you use systrace in your day-to-day work, this |
| is likely a much bigger trace with much more information than you've ever seen |
| in a single trace before.</p> |
| <p>For a consistent, periodic workload such as TouchLatency, the UI pipeline |
| contains the following:</p> |
| <p></p> |
| <ol> |
| <li>EventThread in SurfaceFlinger wakes the app UI thread, signaling it's time |
| to render a new frame.</li> |
| <li>App renders frame in UI thread, RenderThread, and hwuiTasks, using CPU and |
| GPU resources. This is the bulk of the capacity spent for UI.</li> |
| <li>App sends rendered frame to SurfaceFlinger via binder and goes to |
| sleep.</li> |
| <li>A second EventThread in SurfaceFlinger wakes SurfaceFlinger to trigger |
| composition and display output. If SurfaceFlinger determines there is no work to |
| be done, it goes back to sleep.</li> |
| <li>SurfaceFlinger handles composition via HWC/HWC2 or GL. HWC/HWC2 composition |
| is faster and lower power but has limitations depending on the SOC. This usually |
| takes ~4-6ms, but can overlap with step 2 because Android applications are |
| always triple-buffered. (While applications are always triple-buffered, there |
| may only be one pending frame waiting in SurfaceFlinger, which makes it appear |
| identical to double-buffering.)</li> |
| <li>SurfaceFlinger dispatches final output to display via vendor driver and goes |
| back to sleep, waiting for EventThread wakeup.</li> |
| </ol> |
| |
| <p>Let's walk through the frame beginning at 15409ms:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_normal_pipeline.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_normal_pipeline.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 1.</strong> Normal UI pipeline, EventThread running. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 1 is a normal frame surrounded by normal frames, so it's a good |
| starting point for understanding how the UI pipeline works. The UI thread row |
| for TouchLatency includes different colors at different times. Bars denote |
| different states for the thread:</p> |
| |
| <ul> |
| <li><strong>Gray</strong>. Sleeping.</li> |
| <li><strong>Blue</strong>. Runnable (it could run, but the scheduler hasn't |
| picked it to run yet).</li> |
| <li><strong>Green</strong>. Actively running (the scheduler thinks it's |
| running). |
| <p class="note"><strong>Note</strong>: Interrupt handlers aren't shown in the |
| normal per-CPU timeline, so it's possible that you're actually running interrupt |
| handlers or softirqs during some portion of a thread's runtime. Check the irq |
| section of the trace (under process 0) to confirm whether an interrupt is |
| running instead of a standard thread.</p> |
| </li> |
| <li><strong>Red</strong>. Uninterruptible sleep (generally sleeping on a lock |
| in the kernel). Can be indicative of I/O load. Extremely useful for debugging |
| performance issues.</li> |
| <li><strong>Orange</strong>. Uninterruptible sleep due to I/O load.</li> |
| </ul> |
| |
| <p>To view the reason for uninterruptible sleep (available from the |
| <code>sched_blocked_reason</code> tracepoint), select the red uninterruptible |
| sleep slice.</p> |
| <p>While EventThread is running, the UI thread for TouchLatency becomes |
| runnable. To see what woke it, click the blue section:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_tl.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_tl.png" class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 2.</strong> UI thread for TouchLatency. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 2 shows the TouchLatency UI thread was woken by tid 6843, which |
| corresponds to EventThread. The UI thread wakes:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_wake_render_enqueue.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_wake_render_enqueue.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 3.</strong> UI thread wakes, renders a frame, and enqueues it |
| for SurfaceFlinger to consume. |
| </figcaption> |
| </figure> |
| |
| <p>If the <code>binder_driver</code> tag is enabled in a trace, you can select a |
| binder transaction to view information on all of the processes involved in that |
| transaction:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_binder_trans.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_binder_trans.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 4.</strong> Binder transaction. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 4 shows that, at 15,423.65ms, Binder:6832_1 in SurfaceFlinger becomes |
| runnable because of tid 9579, which is TouchLatency's RenderThread. You can also |
| see queueBuffer on both sides of the binder transaction.</p> |
| |
| <p>During the queueBuffer on the SurfaceFlinger side, the number of pending |
| frames from TouchLatency goes from 1 to 2:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_pending_frames.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_pending_frames.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 5.</strong> Pending frames goes from 1 to 2. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 5 shows triple-buffering, where there are two completed frames and the |
| app will soon start rendering a third. This is because we've already dropped |
| some frames, so the app keeps two pending frames instead of one to try to avoid |
| further dropped frames.</p> |
| |
| <p>Soon after, SurfaceFlinger's main thread is woken by a second EventThread so |
| it can output the older pending frame to the display:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_sf_woken_et.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_sf_woken_et.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 6.</strong> SurfaceFlinger's main thread is woken by a second |
| EventThread. |
| </figcaption> |
| </figure> |
| |
| <p>SurfaceFlinger first latches the older pending buffer, which causes the |
| pending buffer count to decrease from 2 to 1:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_sf_latches_pend.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_sf_latches_pend.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 7.</strong> SurfaceFlinger first latches the older pending |
| buffer. |
| </figcaption> |
| </figure> |
| |
| <p>After latching the buffer, SurfaceFlinger sets up composition and submits the |
| final frame to the display. (Some of these sections are enabled as part of the |
| <code>mdss</code> tracepoint, so they may not be there on your SOC.)</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_sf_comp_submit.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_sf_comp_submit.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 8.</strong> SurfaceFlinger sets up composition and submits the |
| final frame. |
| </figcaption> |
| </figure> |
| |
| <p>Next, <code>mdss_fb0</code> wakes on CPU 0. <code>mdss_fb0</code> is the |
| display pipeline's kernel thread for outputting a rendered frame to the display. |
| We can see <code>mdss_fb0</code> as its own row in the trace (scroll down to |
| view).</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_wake_cpu0.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_wake_cpu0.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 9.</strong> <code>mdss_fb0</code> wakes on CPU 0. |
| </figcaption> |
| </figure> |
| |
| <p><code>mdss_fb0</code> wakes up, runs for a bit, enters uninterruptible sleep, |
| then wakes again.</p> |
| |
| <p class="note"><strong>Note</strong>: From this point forward, the trace is |
| more complicated as the final work is split between <code>mdss_fb0</code>, |
| interrupts, and workqueue functions. If you need that level of detail, refer to |
| the exact characteristics of the driver stack for your SOC (as what happens on |
| the Pixel XL might not be useful to you).</p> |
| |
| <h2 id="example_2">Example: Non-working frame</h2> |
| <p>This example describes a systrace used to debug Pixel/Pixel XL jitter. To |
| follow along with the example, <a href="/devices/tech/debug/perf_traces.zip">download the zip |
| file</a> of traces (which also includes other traces referred to in this |
| section), upzip the file, and open the systrace_tutorial.html file in your |
| browser.</p> |
| |
| <p>When you first open the systrace, you'll see something like this:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_tl_pxl.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_tl_pxl.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 10</strong>. TouchLatency running on Pixel XL (most options |
| enabled, including mdss and kgsl tracepoints). |
| </figcaption> |
| </figure> |
| |
| <p>When looking for jank, check the FrameMissed row under SurfaceFlinger. |
| FrameMissed is a quality-of-life improvement provided by Hardware Composer 2 |
| (HWC2). As of December 2016, HWC2 is used only on Pixel/Pixel XL; when viewing |
| systrace for other devices, the FrameMissed row may not be present. In either |
| case, FrameMissed is correlated with SurfaceFlinger missing one of its |
| extremely-regular runtimes and an unchanged pending-buffer count for the app |
| (<code>com.prefabulated.touchlatency</code>) at a vsync:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_fm_sf.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_fm_sf.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 11</strong>. FrameMissed correlation with SurfaceFlinger. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 11 shows a missed frame at 15598.29ms. SurfaceFlinger woke briefly at |
| the vsync interval and went back to sleep without doing any work, which means |
| SurfaceFlinger determined it was not worth trying to send a frame to the display |
| again. Why?</p> |
| |
| <p>To understand how the pipeline broke down for this frame, first review the |
| working frame example above to see how a normal UI pipeline appears in systrace. |
| When ready, return to the missed frame and work backwards. Notice that |
| SurfaceFlinger wakes and immediately goes to sleep. When viewing the number of |
| pending frames from TouchLatency, there are two frames (a good clue to help |
| figure out what's going on).</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_sf_wake_sleep.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_sf_wake_sleep.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 12.</strong> SurfaceFlinger wakes and immediately goes to |
| sleep. |
| </figcaption> |
| </figure> |
| |
| <p>Because we have frames in SurfaceFlinger, it's not an app issue. In addition, |
| SurfaceFlinger is waking at the correct time, so it's not a SurfaceFlinger |
| issue. If SurfaceFlinger and the app are both looking normal, it's probably a |
| driver issue.</p> |
| |
| <p>Because the <code>mdss</code> and <code>sync</code> tracepoints are enabled, |
| we can get information about the fences (shared between the display driver and |
| SurfaceFlinger) that control when frames are actually submitted to the display. |
| The fences we care about are listed under <code>mdss_fb0_retire</code>, which |
| denotes when a frame is actually on the display. These fences are provided as |
| part of the <code>sync</code> trace category. Which fences correspond to |
| particular events in SurfaceFlinger depends on your SOC and driver stack, so |
| work with your SOC vendor to understand the meaning of fence categories in your |
| traces.</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_fences.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_fences.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 13.</strong> <code>mdss_fb0_retire</code> fences. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 13 shows a frame that was displayed for 33ms, not 16.7ms as expected. |
| Halfway through that slice, that frame should have been replaced by a new one |
| but wasn't. View the previous frame and look for anything interesting:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_frame_previous.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_frame_previous.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 14.</strong> Frame previous to busted frame. |
| </figcaption> |
| </figure> |
| |
| <p>Figure 14 shows 14.482ms a frame. The busted two-frame segment was 33.6ms, |
| which is roughly what we would expect for two frames (we render at 60Hz, 16.7ms |
| a frame, which is close). But 14.482ms is not anywhere close to 16.7ms, which |
| suggests that something is very wrong with the display pipe.</p> |
| |
| <p>Investigate exactly where that fence ends to determine what controls it:</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_fence_end.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_fence_end.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 15.</strong> Investigate fence end. |
| </figcaption> |
| </figure> |
| |
| <p>A workqueue contains a <code>__vsync_retire_work_handler</code> that is |
| running when the fence changes. Looking through the kernel source, you can see |
| it's part of the display driver. It definitely appears to be on the critical |
| path for the display pipeline, so it must run as quickly as possible. It's |
| runnable for 70us or so (not a long scheduling delay), but it's a workqueue and |
| might not get scheduled accurately.</p> |
| |
| <p>Check the previous frame to determine if that contributed; sometimes jitter |
| can add up over time and eventually cause us to miss a deadline.</p> |
| |
| <figure> |
| <a href="/devices/tech/debug/images/perf_trace_previous_frame.png" |
| title="Click to enlarge"> |
| <img src="/devices/tech/debug/images/perf_trace_previous_frame.png" |
| class="screenshot"> |
| </a> |
| <figcaption> |
| <strong>Figure 16.</strong> Previous frame. |
| </figcaption> |
| </figure> |
| |
| <p>The runnable line on the kworker isn't visible because the viewer turns it |
| white when it's selected, but the statistics tell the story: 2.3ms of scheduler |
| delay for part of the display pipeline critical path is <strong>bad</strong>. |
| Before we do anything else, we should fix that by moving this part of the |
| display pipeline critical path from a workqueue (which runs as a |
| <code>SCHED_OTHER</code> CFS thread) to a dedicated <code>SCHED_FIFO</code> |
| kthread. This function needs timing guarantees that workqueues can't (and aren't |
| intended to) provide.</p> |
| |
| <p>Is this the reason for the jank? It's hard to say conclusively. Outside of |
| easy-to-diagnose cases such as kernel lock contention causing display-critical |
| threads to sleep, traces usually won't tell you directly what the problem is. |
| Could this jitter have been the cause of the dropped frame? Absolutely. The |
| fence times should be 16.7ms, but they aren't close to that at all in the frames |
| leading up to the dropped frame. (There's a 19ms fence followed by a 14ms |
| fence.) Given how tightly coupled the display pipeline is, it's entirely |
| possible the jitter around fence timings resulted in an eventual dropped |
| frame.</p> |
| |
| <p>In this example, the solution involved converting |
| <code>__vsync_retire_work_handler</code> from a workqueue to a dedicated |
| kthread. This resulted in noticeable jitter improvements and reduced jank in the |
| bouncing ball test. Subsequent traces show fence timings that hover very close |
| to 16.7ms.</p> |
| |
| </body> |
| </html> |