| <html devsite> |
| <head> |
| <title>Performance Testing</title> |
| <meta name="project_path" value="/_project.yaml" /> |
| <meta name="book_path" value="/_book.yaml" /> |
| </head> |
| <body> |
| <!-- |
| Copyright 2017 The Android Open Source Project |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| |
| <p>Android 8.0 includes binder and hwbinder performance tests for throughput and |
| latency. While many scenarios exist for detecting perceptible performance |
| problems, running such scenarios can be time consuming and results are often |
| unavailable until after a system is integrated. Using the provided performance |
| tests makes it easier to test during development, detect serious problems |
| earlier, and improve user experience.</p> |
| |
| <p>Performance tests include the following four categories:</p> |
| <ul> |
| <li>binder throughput (available in |
| <code>system/libhwbinder/vts/performance/Benchmark_binder.cpp</code>)</li> |
| <li>binder latency (available in |
| <code>frameworks/native/libs/binder/tests/schd-dbg.cpp</code>)</li> |
| <li>hwbinder throughput (available in |
| <code>system/libhwbinder/vts/performance/Benchmark.cpp</code>)</li> |
| <li>hwbinder latency (available in |
| <code>system/libhwbinder/vts/performance/Latency.cpp</code>)</li> |
| </ul> |
| |
| <h2 id=about>About binder and hwbinder</h2> |
| <p>Binder and hwbinder are Android inter-process communication (IPC) |
| infrastructures that share the same Linux driver but have the following |
| qualitative differences:</p> |
| |
| <table> |
| <tr> |
| <th>Aspect</th> |
| <th>binder</th> |
| <th>hwbinder</th> |
| </tr> |
| |
| <tr> |
| <td>Purpose</td> |
| <td>Provide a general purpose IPC scheme for framework</td> |
| <td>Communicate with hardware</td> |
| </tr> |
| |
| <tr> |
| <td>Property</td> |
| <td>Optimized for Android framework usage</td> |
| <td>Minimum overhead low latency</td> |
| </tr> |
| |
| <tr> |
| <td>Change scheduling policy for foreground/background</td> |
| <td>Yes</td> |
| <td>No</td> |
| </tr> |
| |
| <tr> |
| <td>Arguments passing</td> |
| <td>Uses serialization supported by Parcel object</td> |
| <td>Uses scatter buffers and avoids the overhead to copy data required for |
| Parcel serialization</td> |
| </tr> |
| |
| <tr> |
| <td>Priority inheritance</td> |
| <td>No</td> |
| <td>Yes</td> |
| </tr> |
| |
| </table> |
| |
| <h3 id=transactions>Binder and hwbinder processes</h2> |
| <p>A systrace visualizer displays transactions as follows:</p> |
| <img src="images/treble_systrace_binder_processes.png"> |
| <figcaption><strong>Figure 1.</strong> Systrace visualization of binder |
| processes.</figcaption> |
| |
| <p>In the above example:</p> |
| <ul> |
| <li>The four (4) schd-dbg processes are client processes.</li> |
| <li>The four (4) binder processes are server processes (name starts with |
| <strong>Binder</strong> and ends with a sequence number).</li> |
| <li>A client process is always paired with a server process, which is dedicated |
| to its client.</li> |
| <li>All the client-server process pairs are scheduled independently by kernel |
| concurrently.</li> |
| </ul> |
| |
| <p>In CPU 1, the OS kernel executes the client to issue the request. It then |
| uses the same CPU whenever possible to wake up a server process, handle the |
| request, and context switch back after the request is complete.</p> |
| |
| <h3 id=throughput-diffs>Throughput vs. latency</h3> |
| <p>In a perfect transaction, where the client and server process switch |
| seamlessly, throughput and latency tests do not produce substantially different |
| messages. However, when the OS kernel is handling an interrupt request (IRQ) |
| from hardware, waiting for locks, or simply choosing not to handle a message |
| immediately, a latency bubble can form.</p> |
| |
| <img src="images/treble_latency_bubble.png"> |
| <figcaption><strong>Figure 2.</strong> Latency bubble due to differences in |
| throughput and latency.</figcaption> |
| |
| <p>The throughput test generates a large number of transactions with different |
| payload sizes, providing a good estimation for the regular transaction time (in |
| best case scenarios) and the maximum throughput the binder can achieve.</p> |
| |
| <p>In contrast, the latency test performs no actions on the payload to minimize |
| the regular transaction time. We can use transaction time to estimate the binder |
| overhead, make statistics for the worst case, and calculate the ratio of |
| transactions whose latency meets a specified deadline.</p> |
| |
| <h3 id=priority-inversions>Handling priority inversions</h3> |
| <p>A priority inversion occurs when a thread with higher priority is logically |
| waiting for a thread with lower priority. Real-time (RT) applications have a |
| priority inversion problem:</p> |
| |
| <img src="images/treble_priority_inv_rta.png"> |
| <figcaption><strong>Figure 3.</strong> Priority inversion in real-time |
| applications.</figcaption> |
| |
| <p>When using Linux Completely Fair Scheduler (CFS) scheduling, a thread always |
| has a chance to run even when other threads have a higher priority. As a result, |
| applications with CFS scheduling handle priority inversion as expected behavior |
| and not as a problem. In cases where the Android framework needs RT scheduling |
| to guarantee the privilege of high priority threads however, priority inversion |
| must be resolved.</p> |
| |
| <p>Example priority inversion during a binder transaction (RT thread is |
| logically blocked by other CFS threads when waiting for a binder thread to |
| service):</p> |
| <img src="images/treble_priority_inv_rta_blocked.png"> |
| <figcaption><strong>Figure 4.</strong> Priority inversion, blocked real-time |
| threads.</figcaption> |
| |
| <p>To avoid blockages, you can use priority inheritance to temporarily escalate |
| the Binder thread to a RT thread when it services a request from a RT client. |
| Keep in mind that RT scheduling has limited resources and should be used |
| carefully. In a system with <em>n</em> CPUs, the maximum number of current RT |
| threads is also <em>n</em>; additional RT threads might need to wait (and thus |
| miss their deadlines) if all CPUs are taken by other RT threads.</p> |
| |
| <p>To resolve all possible priority inversions, you could use priority |
| inheritance for both binder and hwbinder. However, as binder is widely used |
| across the system, enabling priority inheritance for binder transactions might |
| spam the system with more RT threads than it can service.</p> |
| |
| <h2 id=throughput>Running throughput tests</h2> |
| <p>The throughput test is run against binder/hwbinder transaction throughput. In |
| a system that is not overloaded, latency bubbles are rare and their impact |
| can be eliminated as long as the number of iterations is high enough.</p> |
| |
| <ul> |
| <li>The <strong>binder</strong> throughput test is in |
| <code>system/libhwbinder/vts/performance/Benchmark_binder.cpp</code>.</li> |
| <li>The <strong>hwbinder</strong> throughput test is in |
| <code>system/libhwbinder/vts/performance/Benchmark.cpp</code>.</li> |
| </ul> |
| |
| <h3 id=throughput-results>Test results</h3> |
| <p>Example throughput test results for transactions using different payload |
| sizes:</p> |
| |
| <pre class="prettyprint"> |
| Benchmark Time CPU Iterations |
| --------------------------------------------------------------------- |
| BM_sendVec_binderize/4 70302 ns 32820 ns 21054 |
| BM_sendVec_binderize/8 69974 ns 32700 ns 21296 |
| BM_sendVec_binderize/16 70079 ns 32750 ns 21365 |
| BM_sendVec_binderize/32 69907 ns 32686 ns 21310 |
| BM_sendVec_binderize/64 70338 ns 32810 ns 21398 |
| BM_sendVec_binderize/128 70012 ns 32768 ns 21377 |
| BM_sendVec_binderize/256 69836 ns 32740 ns 21329 |
| BM_sendVec_binderize/512 69986 ns 32830 ns 21296 |
| BM_sendVec_binderize/1024 69714 ns 32757 ns 21319 |
| BM_sendVec_binderize/2k 75002 ns 34520 ns 20305 |
| BM_sendVec_binderize/4k 81955 ns 39116 ns 17895 |
| BM_sendVec_binderize/8k 95316 ns 45710 ns 15350 |
| BM_sendVec_binderize/16k 112751 ns 54417 ns 12679 |
| BM_sendVec_binderize/32k 146642 ns 71339 ns 9901 |
| BM_sendVec_binderize/64k 214796 ns 104665 ns 6495 |
| </pre> |
| |
| <ul> |
| <li><strong>Time</strong> indicates the round trip delay measured in real time. |
| </li> |
| <li><strong>CPU</strong> indicates the accumulated time when CPUs are scheduled |
| for the test.</li> |
| <li><strong>Iterations</strong> indicates the number of times the test function |
| executed.</li> |
| </ul> |
| |
| <p>For example, for an 8-byte payload:</p> |
| |
| <pre class="prettyprint"> |
| BM_sendVec_binderize/8 69974 ns 32700 ns 21296 |
| </pre> |
| <p>… the maximum throughput the binder can achieve is calculated as:</p> |
| <p><em>MAX throughput with 8-byte payload = (8 * 21296)/69974 ~= 2.423 b/ns ~= |
| 2.268 Gb/s</em></p> |
| |
| <h3 id=throughput-options>Test options</h3> |
| <p>To get results in .json, run the test with the |
| <code>--benchmark_format=json</code> argument:</p> |
| |
| <pre class="prettyprint"> |
| <code class="devsite-terminal">libhwbinder_benchmark --benchmark_format=json</code> |
| { |
| "context": { |
| "date": "2017-05-17 08:32:47", |
| "num_cpus": 4, |
| "mhz_per_cpu": 19, |
| "cpu_scaling_enabled": true, |
| "library_build_type": "release" |
| }, |
| "benchmarks": [ |
| { |
| "name": "BM_sendVec_binderize/4", |
| "iterations": 32342, |
| "real_time": 47809, |
| "cpu_time": 21906, |
| "time_unit": "ns" |
| }, |
| …. |
| } |
| </pre> |
| |
| <h2 id=latency>Running latency tests</h2> |
| <p>The latency test measures the time it takes for the client to begin |
| initializing the transaction, switch to the server process for handling, and |
| receive the result. The test also looks for known bad scheduler behaviors that |
| can negatively impact transaction latency, such as a scheduler that does not |
| support priority inheritance or honor the sync flag.</p> |
| |
| <ul> |
| <li>The binder latency test is in |
| <code>frameworks/native/libs/binder/tests/schd-dbg.cpp</code>.</li> |
| <li>The hwbinder latency test is in |
| <code>system/libhwbinder/vts/performance/Latency.cpp</code>.</li> |
| </ul> |
| |
| <h3 id=latency-results>Test results</h3> |
| <p>Results (in .json) show statistics for average/best/worst latency and the |
| number of deadlines missed.</p> |
| |
| <h3 id=latency-options>Test options</h3> |
| <p>Latency tests take the following options:</p> |
| |
| <table> |
| <tr> |
| <th>Command</th> |
| <th>Description</th> |
| </tr> |
| |
| <tr> |
| <td><code>-i <em>value</em></code></td> |
| <td>Specify number of iterations.</td> |
| </tr> |
| |
| <tr> |
| <td><code>-pair <em>value</em></code></td> |
| <td>Specify the number of process pairs.</td> |
| </tr> |
| |
| <tr> |
| <td><code>-deadline_us 2500</code></td> |
| <td>Specify the deadline in us.</td> |
| </tr> |
| |
| <tr> |
| <td><code>-v</code></td> |
| <td>Get verbose (debugging) output.</td> |
| </tr> |
| |
| <tr> |
| <td><code>-trace</code></td> |
| <td>Halt the trace on a deadline hit.</td> |
| </tr> |
| |
| </table> |
| |
| <p>The following sections detail each option, describe usage, and provide |
| example results.</p> |
| |
| <h4 id=iterations>Specifying iterations</h4> |
| <p>Example with a large number of iterations and verbose output disabled:</p> |
| |
| <pre class="prettyprint"> |
| <code class="devsite-terminal">libhwbinder_latency -i 5000 -pair 3</code> |
| { |
| "cfg":{"pair":3,"iterations":5000,"deadline_us":2500}, |
| "P0":{"SYNC":"GOOD","S":9352,"I":10000,"R":0.9352, |
| "other_ms":{ "avg":0.2 , "wst":2.8 , "bst":0.053, "miss":2, "meetR":0.9996}, |
| "fifo_ms": { "avg":0.16, "wst":1.5 , "bst":0.067, "miss":0, "meetR":1} |
| }, |
| "P1":{"SYNC":"GOOD","S":9334,"I":10000,"R":0.9334, |
| "other_ms":{ "avg":0.19, "wst":2.9 , "bst":0.055, "miss":2, "meetR":0.9996}, |
| "fifo_ms": { "avg":0.16, "wst":3.1 , "bst":0.066, "miss":1, "meetR":0.9998} |
| }, |
| "P2":{"SYNC":"GOOD","S":9369,"I":10000,"R":0.9369, |
| "other_ms":{ "avg":0.19, "wst":4.8 , "bst":0.055, "miss":6, "meetR":0.9988}, |
| "fifo_ms": { "avg":0.15, "wst":1.8 , "bst":0.067, "miss":0, "meetR":1} |
| }, |
| "inheritance": "PASS" |
| } |
| </pre> |
| <p>These test results show the following:</p> |
| |
| <dl> |
| <dt><strong><code>"pair":3</code></strong></dt> |
| <dd>Creates one client and server pair.</dd> |
| |
| <dt><strong><code>"iterations": 5000</code></strong></dt> |
| <dd>Includes 5000 iterations.</dd> |
| |
| <dt><strong><code>"deadline_us":2500</code></strong></dt> |
| <dd>Deadline is 2500us (2.5ms); most transactions are expected to meet this |
| value.</dd> |
| |
| <dt><strong><code>"I": 10000</code></strong></dt> |
| <dd>A single test iteration includes two (2) transactions: |
| <ul> |
| <li>One transaction by normal priority (<code>CFS other</code>)</li> |
| <li>One transaction by real time priority (<code>RT-fifo</code>)</li> |
| </ul> |
| 5000 iterations equals a total of 10000 transactions.</dd> |
| |
| <dt><strong><code>"S": 9352</code></strong></dt> |
| <dd>9352 of the transactions are synced in the same CPU.</dd> |
| |
| <dt><strong><code>"R": 0.9352</code></strong></dt> |
| <dd>Indicates the ratio at which the client and server are synced together in |
| the same CPU.</dd> |
| |
| <dt><strong><code>"other_ms":{ "avg":0.2 , "wst":2.8 , "bst":0.053, "miss":2, |
| "meetR":0.9996}</code></strong></dt> |
| <dd>The average (<code>avg</code>), worst (<code>wst</code>), and the best |
| (<code>bst</code>) case for all transactions issued by a normal priority caller. |
| Two transactions <code>miss</code> the deadline, making the meet ratio |
| (<code>meetR</code>) 0.9996.</dd> |
| |
| <dt><strong><code>"fifo_ms": { "avg":0.16, "wst":1.5 , "bst":0.067, "miss":0, |
| "meetR":1}</code></strong></dt> |
| <dd>Similar to <code>other_ms</code>, but for transactions issued by client with |
| <code>rt_fifo</code> priority. It's likely (but not required) that the |
| <code>fifo_ms</code> has a better result than <code>other_ms</code>, with lower |
| <code>avg</code> and <code>wst</code> values and a higher <code>meetR</code> |
| (the difference can be even more significant with load in the background).</dd> |
| |
| </dl> |
| |
| <p class="note"><strong>Note:</strong> Background load may impact the throughput |
| result and the <code>other_ms</code> tuple in the latency test. Only the |
| <code>fifo_ms</code> may show similar results as long as the background load has |
| a lower priority than <code>RT-fifo</code>.</p> |
| |
| <h4 id=pair-values>Specifying pair values</h4> |
| <p>Each client process is paired with a server process dedicated for the client, |
| and each pair may be scheduled independently to any CPU. However, the CPU |
| migration should not happen during a transaction as long as the SYNC flag is |
| <code>honor</code>.</p> |
| |
| <p>Ensure the system is not overloaded! While high latency in an overloaded |
| system is expected, test results for an overloaded system do not provide useful |
| information. To test a system with higher pressure, use <code>-pair |
| #cpu-1</code> (or <code>-pair #cpu</code> with caution). Testing using |
| <code>-pair <em>n</em></code> with <code><em>n</em> > #cpu</code> overloads the |
| system and generates useless information.</p> |
| |
| <h4 id=deadline-values>Specifying deadline values</h4> |
| <p>After extensive user scenario testing (running the latency test on a |
| qualified product), we determined that 2.5ms is the deadline to meet. For new |
| applications with higher requirements (such as 1000 photos/second), this |
| deadline value will change.</p> |
| |
| <h4 id=verbose>Specifying verbose output</h4> |
| <p>Using the <code>-v</code> option displays verbose output. Example:</p> |
| |
| <pre class="devsite-click-to-copy"> |
| <code class="devsite-terminal">libhwbinder_latency -i 1 -v</code> |
| |
| <div style="color: orange">-------------------------------------------------- |
| service pid: 8674 tid: 8674 cpu: 1 |
| SCHED_OTHER 0</div> |
| -------------------------------------------------- |
| main pid: 8673 tid: 8673 cpu: 1 |
| |
| -------------------------------------------------- |
| client pid: 8677 tid: 8677 cpu: 0 |
| SCHED_OTHER 0 |
| |
| <div style="color: blue">-------------------------------------------------- |
| fifo-caller pid: 8677 tid: 8678 cpu: 0 |
| SCHED_FIFO 99 |
| |
| -------------------------------------------------- |
| hwbinder pid: 8674 tid: 8676 cpu: 0 |
| ??? 99</div> |
| <div style="color: green">-------------------------------------------------- |
| other-caller pid: 8677 tid: 8677 cpu: 0 |
| SCHED_OTHER 0 |
| |
| -------------------------------------------------- |
| hwbinder pid: 8674 tid: 8676 cpu: 0 |
| SCHED_OTHER 0</div> |
| </pre> |
| |
| <ul> |
| <li>The <font style="color:orange">service thread</font> is created with a |
| <code>SCHED_OTHER</code> priority and run in <code>CPU:1</code> with <code>pid |
| 8674</code>.</li> |
| <li>The <font style="color:blue">first transaction</font> is then started by a |
| <code>fifo-caller</code>. To service this transaction, the hwbinder upgrades the |
| priority of server (<code>pid: 8674 tid: 8676</code>) to be 99 and also marks it |
| with a transient scheduling class (printed as <code>???</code>). The scheduler |
| then puts the server process in <code>CPU:0</code> to run and syncs it with the |
| same CPU with its client.</li> |
| <li>The <font style="color:green">second transaction</font> caller has a |
| <code>SCHED_OTHER</code> priority. The server downgrades itself and services the |
| caller with <code>SCHED_OTHER</code> priority.</li> |
| </ul> |
| |
| <h4 id=trace>Using trace for debugging</h4> |
| <p>You can specify the <code>-trace</code> option to debug latency issues. When |
| used, the latency test stops the tracelog recording at the moment when bad |
| latency is detected. Example:</p> |
| |
| <pre class="prettyprint"> |
| <code class="devsite-terminal">atrace --async_start -b 8000 -c sched idle workq binder_driver sync freq</code> |
| <code class="devsite-terminal">libhwbinder_latency -deadline_us 50000 -trace -i 50000 -pair 3</code> |
| deadline triggered: halt ∓ stop trace |
| log:/sys/kernel/debug/tracing/trace |
| </pre> |
| |
| <p>The following components can impact latency:</p> |
| |
| <ul> |
| <li><strong>Android build mode</strong>. Eng mode is usually slower than |
| userdebug mode.</li> |
| <li><strong>Framework</strong>. How does the framework service use |
| <code>ioctl</code> to config to the binder?</li> |
| <li><strong>Binder driver</strong>. Does the driver support fine-grained |
| locking? Does it contain all performance turning patches?</li> |
| <li><strong>Kernel version</strong>. The better real time capability the kernel |
| has, the better the results.</li> |
| <li><strong>Kernel config</strong>. Does the kernel config contain |
| <code>DEBUG</code> configs such as <code>DEBUG_PREEMPT</code> and |
| <code>DEBUG_SPIN_LOCK</code>?</li> |
| <li><strong>Kernel scheduler</strong>. Does the kernel have an Energy-Aware |
| scheduler (EAS) or Heterogeneous Multi-Processing (HMP) scheduler? Do any kernel |
| drivers (<code>cpu-freq</code> driver, <code>cpu-idle</code> driver, |
| <code>cpu-hotplug</code>, etc.) impact the scheduler?</li> |
| </ul> |
| |
| </body> |
| </html> |