Android Studio includes a graphical front end to Simpleperf, documented in Inspect CPU activity with CPU Profiler. Most users will prefer to use that instead of using Simpleperf directly.
Simpleperf is a native CPU profiling tool for Android. It can be used to profile both Android applications and native processes running on Android. It can profile both Java and C++ code on Android. The simpleperf executable can run on Android >=L, and Python scripts can be used on Android >= N.
Simpleperf is part of the Android Open Source Project. The source code is here. The latest document is here.
An introduction slide deck is here.
Simpleperf contains two parts: the simpleperf executable and Python scripts.
The simpleperf executable works similar to linux-tools-perf, but has some specific features for the Android profiling environment:
It collects more info in profiling data. Since the common workflow is “record on the device, and report on the host”, simpleperf not only collects samples in profiling data, but also collects needed symbols, device info and recording time.
It delivers new features for recording.
It relates closely to the Android platform.
It builds executables and shared libraries for different usages.
Detailed documentation for the simpleperf executable is here.
Python scripts are split into three parts according to their functions:
Scripts used for recording, like app_profiler.py, run_simpleperf_without_usb_connection.py.
Scripts used for reporting, like report.py, report_html.py, inferno.
Scripts used for parsing profiling data, like simpleperf_report_lib.py.
The python scripts are tested on Python >= 3.9. Older versions may not be supported. Detailed documentation for the Python scripts is here.
The simpleperf executables and Python scripts are located in simpleperf/ in ndk releases, and in system/extras/simpleperf/scripts/ in AOSP. Their functions are listed below.
bin/: contains executables and shared libraries.
bin/android/${arch}/simpleperf: static simpleperf executables used on the device.
bin/${host}/${arch}/simpleperf: simpleperf executables used on the host, only supports reporting.
bin/${host}/${arch}/libsimpleperf_report.${so/dylib/dll}: report shared libraries used on the host.
*.py, inferno, purgatorio: Python scripts used for recording and reporting. Details are in scripts_reference.md.
See android_application_profiling.md.
See android_platform_profiling.md.
See executable_commands_reference.md.
See scripts_reference.md.
See view_the_profile.md.
On Android < N, the kernel may be too old (< 3.18) to support features like recording DWARF based call graphs. On Android M - O, we can only profile C++ code and fully compiled Java code. On Android >= P, the ART interpreter supports DWARF based unwinding. So we can profile Java code. On Android >= Q, we can used simpleperf shipped on device to profile released Android apps, with <profileable android:shell="true" />
.
Simpleperf supports two ways recording call stacks with samples. One is DWARF based call graph, the other is stack frame based call graph. Below is their comparison:
Recording DWARF based call graph:
Recording stack frame based call graph:
So if you need to profile code on ARM or profile Java code, DWARF based call graph is better. If you need to profile C++ code on ARM64, stack frame based call graphs may be better. After all, you can fisrt try DWARF based call graph, which is also the default option when -g
is used. Because it always produces reasonable results. If it doesn't work well enough, then try stack frame based call graph instead.
A DWARF-based call graph is generated by unwinding thread stacks. When a sample is recorded, a kernel dumps up to 64 kilobytes of stack data. By unwinding the stack based on DWARF information, we can get a call stack.
Two reasons may cause a broken call stack:
The kernel can only dump up to 64 kilobytes of stack data for each sample, but a thread can have much larger stack. In this case, we can't unwind to the thread start point.
We need binaries containing DWARF call frame information to unwind stack frames. The binary should have one of the following sections: .eh_frame, .debug_frame, .ARM.exidx or .gnu_debugdata.
To mitigate these problems,
For the missing stack data problem:
To alleviate it, simpleperf joins callchains (call stacks) after recording. If two callchains of a thread have an entry containing the same ip and sp address, then simpleperf tries to join them to make the callchains longer. So we can get more complete callchains by recording longer and joining more samples. This doesn't guarantee to get complete call graphs. But it usually works well.
Simpleperf stores samples in a buffer before unwinding them. If the bufer is low in free space, simpleperf may decide to truncate stack data for a sample to 1K. Hopefully, this can be recovered by callchain joiner. But when a high percentage of samples are truncated, many callchains can be broken. We can tell if many samples are truncated in the record command output, like:
$ simpleperf record ... simpleperf I cmd_record.cpp:809] Samples recorded: 105584 (cut 86291). Samples lost: 6501. $ simpleperf record ... simpleperf I cmd_record.cpp:894] Samples recorded: 7,365 (1,857 with truncated stacks).
There are two ways to avoid truncating samples. One is increasing the buffer size, like --user-buffer-size 1G
. But --user-buffer-size
is only available on latest simpleperf. If that option isn't available, we can use --no-cut-samples
to disable truncating samples.
For the missing DWARF call frame info problem:
Most C++ code generates binaries containing call frame info, in .eh_frame or .ARM.exidx sections. These sections are not stripped, and are usually enough for stack unwinding.
For C code and a small percentage of C++ code that the compiler is sure will not generate exceptions, the call frame info is generated in .debug_frame section. .debug_frame section is usually stripped with other debug sections. One way to fix it, is to download unstripped binaries on device, as here.
The compiler doesn‘t generate unwind instructions for function prologue and epilogue. Because they operates stack frames and will not generate exceptions. But profiling may hit these instructions, and fails to unwind them. This usually doesn’t matter in a frame graph. But in a time based Stack Chart (like in Android Studio and Firefox profiler), this causes stack gaps once in a while. We can remove stack gaps via --remove-gaps
, which is already enabled by default.
When using dwarf based call graphs, simpleperf generates callchains during recording to save space. The debug information needed to unwind C functions is in .debug_frame section, which is usually stripped in native libraries in apks. To fix this, we can download unstripped version of native libraries on device, and ask simpleperf to use them when recording.
To use simpleperf directly:
# create native_libs dir on device, and push unstripped libs in it (nested dirs are not supported). $ adb shell mkdir /data/local/tmp/native_libs $ adb push <unstripped_dir>/*.so /data/local/tmp/native_libs # run simpleperf record with --symfs option. $ adb shell simpleperf record xxx --symfs /data/local/tmp/native_libs
To use app_profiler.py:
$ ./app_profiler.py -lib <unstripped_dir>
The simpleperf record command collects symbols on device in perf.data. But if the native libraries you use on device are stripped, this will result in a lot of unknown symbols in the report. A solution is to build binary_cache on host.
# Collect binaries needed by perf.data in binary_cache/. $ ./binary_cache_builder.py -lib NATIVE_LIB_DIR,...
The NATIVE_LIB_DIRs passed in -lib option are the directories containing unstripped native libraries on host. After running it, the native libraries containing symbol tables are collected in binary_cache/ for use when reporting.
$ ./report.py --symfs binary_cache # report_html.py searches binary_cache/ automatically, so you don't need to # pass it any argument. $ ./report_html.py
To show hot places at source code and instruction level, we need to show source code and disassembly with event count annotation. Simpleperf supports showing annotated source code and disassembly for C++ code and fully compiled Java code. Simpleperf supports two ways to do it:
Through report_html.py:
binary_cache_builder.py -i perf.data -lib <dir_of_lib_with_debug_info>
.Through pprof.
pprof_proto_generator.py
.When using simpleperf record
, we may see lost samples or samples with truncated stack data. Before saving samples to a file, simpleperf uses two buffers to cache samples in memory. One is a kernel buffer, the other is a userspace buffer. The kernel puts samples to the kernel buffer. Simpleperf moves samples from the kernel buffer to the userspace buffer before processing them. If a buffer overflows, we lose samples or get samples with truncated stack data. Below is an example.
$ simpleperf record -a --duration 1 -g --user-buffer-size 100k simpleperf I cmd_record.cpp:799] Recorded for 1.00814 seconds. Start post processing. simpleperf I cmd_record.cpp:894] Samples recorded: 79 (16 with truncated stacks). Samples lost: 2,129 (kernelspace: 18, userspace: 2,111). simpleperf W cmd_record.cpp:911] Lost 18.5567% of samples in kernel space, consider increasing kernel buffer size(-m), or decreasing sample frequency(-f), or increasing sample period(-c). simpleperf W cmd_record.cpp:928] Lost/Truncated 97.1233% of samples in user space, consider increasing userspace buffer size(--user-buffer-size), or decreasing sample frequency(-f), or increasing sample period(-c).
In the above example, we get 79 samples, 16 of them are with truncated stack data. We lose 18 samples in the kernel buffer, and lose 2111 samples in the userspace buffer.
To reduce lost samples in the kernel buffer, we can increase kernel buffer size via -m
. To reduce lost samples in the userspace buffer, or reduce samples with truncated stack data, we can increase userspace buffer size via --user-buffer-size
.
We can also reduce samples generated in a fixed time period, like reducing sample frequency using -f
, reducing monitored threads, not monitoring multiple perf events at the same time.
Bugs and feature requests can be submitted at https://github.com/android/ndk/issues. Patches can be uploaded to android-review.googlesource.com as here, or sent to email addresses listed here.
If you want to compile simpleperf C++ source code, follow below steps:
$ . build/envsetup.sh $ lunch aosp_arm64-trunk_staging-userdebug $ mmma system/extras/simpleperf -j30
If built successfully, out/target/product/generic_arm64/system/bin/simpleperf is for ARM64, and out/target/product/generic_arm64/system/bin/simpleperf32 is for ARM.
The source code of simpleperf python scripts is in system/extras/simpleperf/scripts. Most scripts rely on simpleperf binaries to work. To update binaries for scripts (using linux x86_64 host and android arm64 target as an example):
$ cp out/host/linux-x86/lib64/libsimpleperf_report.so system/extras/simpleperf/scripts/bin/linux/x86_64/libsimpleperf_report.so $ cp out/target/product/generic_arm64/system/bin/simpleperf_ndk64 system/extras/simpleperf/scripts/bin/android/arm64/simpleperf
Then you can try the latest simpleperf scripts and binaries in system/extras/simpleperf/scripts.