| # App Code is Memory |
| |
| The code you write is itself a form of memory use. Every class, method, and |
| string constant in your application must be loaded into RAM when it is executed. |
| The larger your application's code base, the more memory it will consume just to |
| exist. |
| |
| ## File-Backed Memory and Demand Paging |
| |
| Android loads executable code from your `.apk` (like `.oat` or `.so` files) |
| using `mmap`. This means the code is **file-backed**. |
| |
| Crucially, Android uses **demand paging**. When your app starts, the kernel does |
| not load the entire APK into RAM immediately. Instead, it only maps the file |
| into the process's virtual address space. As your app executes, and the CPU |
| jumps to a new function, it triggers a "page fault." The kernel pauses the |
| thread, reads that specific 4KB page of code from the storage into physical RAM, |
| and resumes execution. |
| |
|  |
| |
| <!-- |
| Source for the above diagram is located at: images/app-code/demand_paging.dot |
| To regenerate: `dot -Tpng images/app-code/demand_paging.dot -o images/app-code/demand_paging.png` |
| --> |
| |
| This means that code you *package* but never *execute* does not use physical |
| memory for the code pages themselves. However, unused libraries still increase |
| the overall APK size and can significantly increase the memory used by the |
| system's internal metadata (like DEX indices and class descriptors), which must |
| be read to even *know* that the code exists. Furthermore, many libraries contain |
| static initializers or are touched by dependency injection frameworks during app |
| startup, causing them to be paged into RAM anyway. |
| |
| ### Page Eviction and Slowdowns |
| |
| Because file-backed memory can always be re-read from storage, the kernel considers |
| these pages "clean". When the system experiences memory pressure, the kernel |
| will **evict** (drop) these clean code pages from RAM to make room for other |
| things. |
| |
| If your app later needs to execute that code again, the CPU will fault, and the |
| kernel must re-read the page from storage. **The more code your app has, the more |
| vulnerable it is to having its code evicted.** When a user returns to your |
| bloated app after using other apps, they will experience random jank and |
| slowdowns as the CPU constantly stalls waiting for code to be paged back in from |
| storage. |
| |
| **The Cost of a Page Fault:** While it varies heavily based on the device's |
| storage speed (UFS vs. eMMC) and kernel state, a major page fault (reading 4KB |
| from storage) can cost anywhere from **0.5ms to 5ms**. If your startup path touches |
| 500 different pages of unoptimized code, you could easily introduce several |
| hundred milliseconds of pure I/O latency to your app startup time. |
| |
| ## Exploring Code Size with Compiler Explorer |
| |
| To build an intuition for how your Java or Kotlin code translates into native |
| machine code (and thus memory bytes), you can use |
| [Compiler Explorer](https://godbolt.org/). |
| |
| Android support is built directly into Godbolt. It allows you to see how |
| different parts of the Android toolchain (D8, R8, and dex2oat) transform your |
| source code. |
| |
| ### How to use Compiler Explorer with Android: |
| |
| 1. Navigate to [godbolt.org](https://godbolt.org/). |
| 2. Select **Android Java** or **Android Kotlin** from the language dropdown |
| (top left). |
| 3. In the compiler dropdown (top right of the code pane), you can choose |
| between different tools: |
| * **`d8`**: Shows the **Dalvik bytecode** (`.dex`). This is the closest |
| representation to your original code and is easier to read. |
| * **`r8`**: Shows how the **R8 optimizer** shrinks and optimizes your |
| bytecode. |
| * **`dex2oat`**: Shows the final **ARM64 machine code** that actually |
| executes on the device. This is where you can see the real memory impact |
| (4 bytes per instruction). `dex2oat` can target different ISAs, but |
| ARM64 is the most common for mobile phones. |
| 4. **Source<>Output Highlighting**: Hovering over a line of code will highlight |
| the corresponding bytecode or machine code instructions, making it easy to |
| trace the impact of specific statements. |
| 5. **Optimization Pipeline**: In the disassembly view, you can click **Add |
| new...** -> **Opt Pipeline**. This allows you to see the internal steps the |
| compiler takes. You can inspect how the **Internal Representation (IR)** is |
| transformed at each stage (e.g., between the "Inliner (before)" and "Inliner |
| (after)" steps) before it is lowered into the final ARM64 machine code. |
| |
|  |
| |
| ### Why this matters for Memory |
| |
| Every instruction you see in the `dex2oat` output targeting the ARM64 ISA takes |
| up **4 bytes** in your app's executable (`.odex` or `.oat`) file. |
| |
| Try entering code that utilizes different language features and study the |
| compiler’s output: |
| |
| * **Array access vs. List Iterators**: |
| * A simple **array loop** over `int[]` might compile to ~10 instructions |
| (~40 bytes). |
| * A **foreach loop over a `List`** implicitly uses an **`Iterator`**. This |
| can result in 30-40 instructions (~160 bytes) because of the extra |
| method calls (`hasNext()`, `next()`) and the allocation of the iterator |
| object itself. |
| * **R8 Optimization**: Under the right conditions (e.g., when the `List` |
| is proven to be an `ArrayList`), the **R8 optimizer** can transform a |
| foreach loop back into a simple indexed loop, eliminating the iterator |
| overhead and reducing both code size and runtime memory churn. |
| * **Virtual Method Calls**: Involve loading the object's class, finding the |
| method in the `vtable`, and then branching. This usually takes 4-5 |
| instructions (~20 bytes). |
| * **Direct/Static Calls**: Often translate to a single `bl` (Branch with Link) |
| instruction (4 bytes). |
| * **Kotlin Lambdas**: Can generate entire anonymous classes and additional |
| bridge methods, adding hundreds of bytes of code and metadata overhead for a |
| simple functional block. |
| |
| By using Compiler Explorer, you can see how sophisticated language features |
| (like Kotlin lambdas, stream APIs, or heavy use of generics) impact the final |
| compiled size of your application, and how optimizers such as R8 can counteract |
| the cost of language abstractions in some cases. This tool can help you make |
| informed tradeoffs in designing and implementing an app. |
| |
| Broadly speaking, more complexity in your app's code leads to higher memory use. |
| Conversely, simpler code - or code that is simplified by R8 - results in a |
| smaller representation as CPU instructions and bytes in storage and RAM. |
| |
| ## Measuring Code Impact with `meminfo` and `showmap` |
| |
| You can use the standard Android memory tools to see how much memory your app's |
| code is consuming. |
| |
| ### `dumpsys meminfo` |
| |
| When you run `adb shell dumpsys meminfo <package>`, the **Code** category in the |
| **App Summary** section provides a high-level view of code-related memory: |
| |
| ```none |
| App Summary |
| Pss(KB) |
| ------ |
| Java Heap: 3244 |
| Native Heap: 5412 |
| Code: 24512 # <--- Sum of .so, .dex, .oat, .art, etc. |
| ``` |
| |
| ### `showmap` |
| |
| For a more granular view, use `showmap`. It reveals regions out of specific |
| files being mapped to memory: |
| |
| ```bash |
| adb shell showmap $(pidof <package>) | grep -E "\.oat|\.odex|\.dex|\.apk" |
| ``` |
| |
| You will see entries for your application's compiled code: |
| |
| ```none |
| size RSS PSS clean dirty clean dirty swap swapPSS object |
| ------- -------- -------- -------- -------- -------- -------- -------- -------- ---------------- |
| 12288 8192 8192 8192 0 0 0 0 0 /data/app/.../base.odex |
| ``` |
| |
| ## Dead Code and R8 |
| |
| Because every executed method takes up memory, having a "bloated" app with |
| unnecessary initializations or unused libraries can severely impact startup |
| performance and baseline memory usage. |
| |
| This is why tools like [R8](https://r8.googlesource.com/r8) (ProGuard) are |
| critical. R8 analyzes your application's bytecode and removes any classes or |
| methods that are never called ("dead code stripping"). |
| |
| ### Hands-on Exercise: The Cost of Bloat |
| |
| To demonstrate the impact of code size, we have created two versions of an |
| application in the `samples/CodeBloat/` directory. |
| |
| The build script `generate_code.sh` artificially creates 300 Java classes, each |
| with 500 methods containing unique strings. |
| |
| - **CodeBloat**: The standard, unoptimized build containing all generated |
| classes. |
| - **CodeBloatOptimized**: The same source code, but compiled with R8 shrinking |
| enabled. |
| |
| The `MainActivity` in both apps attempts to touch all 300 classes on a |
| background thread during startup. |
| |
| #### 1. Build and Install |
| |
| ```bash |
| m CodeBloat CodeBloatOptimized |
| adb install -r $OUT/system/app/CodeBloat/CodeBloat.apk |
| adb install -r $OUT/system/app/CodeBloatOptimized/CodeBloatOptimized.apk |
| ``` |
| |
| #### 2. Speed Compile |
| |
| To maximize the file-backed memory impact, we will use the `cmd package compile` |
| tool to ahead-of-time (AOT) compile the apps into `.oat` files. |
| |
| ```bash |
| adb shell cmd package compile -m speed -f com.android.codebloat |
| adb shell cmd package compile -m speed -f com.android.codebloat.optimized |
| ``` |
| |
| Please note that this is a synthetic example. Typically, apps will use the |
| `speed-profile` compilation mode (see further below). |
| |
| #### 3. Launch and Compare |
| |
| Launch the unoptimized app and check its memory footprint: |
| |
| ```bash |
| adb shell am start -W -n com.android.codebloat/.MainActivity |
| sleep 5 # Wait for the background thread to load classes |
| adb shell dumpsys meminfo -s com.android.codebloat |
| ``` |
| |
| Now do the same for the optimized app: |
| |
| ```bash |
| adb shell am start -W -n com.android.codebloat.optimized/com.android.codebloat.MainActivity |
| sleep 5 |
| adb shell dumpsys meminfo -s com.android.codebloat.optimized |
| ``` |
| |
| **The Results:** |
| |
| If you look at the **Code** row in the `App Summary` section, you will see a |
| massive difference: |
| |
| - **Unoptimized `Code`**: ~30,000 KB (30 MB) |
| - **Optimized `Code`**: ~2,000 KB (2 MB) |
| |
| Because R8 determined that the 500 methods inside those classes were never |
| actually doing anything useful (the `doSomething()` method only calls |
| `method0()`, and the results are ignored), it stripped almost all of the |
| artificially generated code out of the final APK. |
| |
| #### 4. View Startup in Perfetto |
| |
| The impact of this code bloat is extremely visible during application startup. |
| |
| **Unoptimized App (`mem.rss.file` climbs massively):** |
| |
|  |
| |
| In the unoptimized app, the `mem.rss.file` track (representing file-backed |
| memory) increases dramatically during the application startup phase. As the app |
| touches the bloated, artificially generated classes, the operating system is |
| forced to page in large amounts of code from the compiled `.oat` file on storage. |
| You can visually see this impact in the thread state track for the main thread: |
| the high frequency of **yellow slices** indicates the thread is frequently |
| blocked and stalling on file I/O while waiting for these code pages to be read. |
| The bottom panel shows a massive delta value, adding up to over 141MB of |
| file-backed memory paged into RAM. This heavy I/O causes the startup to take |
| over 1.2 seconds, resulting in a noticeably sluggish user experience. |
| |
| Note: the trace screenshots demonstrate a memory trend, but actual magnitudes |
| will vary by device characteristics. |
| |
| **Optimized App (`mem.rss.file` peaks at a lower value):** |
| |
|  |
| |
| <--! TODO retake screenshots, showing the breakdown of thread state time, and |
| zooming on classloading slices. --> |
| |
| In the optimized app, R8 has stripped the dead code out of the APK during the |
| build process, leaving far fewer executable pages to read from storage. The |
| `mem.rss.file` track climbs much less (a delta of only ~114MB), and the total |
| startup time is drastically reduced to roughly 743ms. This prevents I/O stalls |
| and leaves more free memory for the rest of the system. |
| |
| **Startup Comparison:** |
| |
| Metric | Unoptimized (CodeBloat) | Optimized (CodeBloatOptimized) |
| :----------------------- | :---------------------- | :----------------------------- |
| **Startup Time** | ~1.25 seconds | ~743 ms |
| **`mem.rss.file` Delta** | ~141 MB | ~114 MB |
| |
| #### PerfettoSQL for File-Backed Memory |
| |
| You can run a query to track the maximum amount of file-backed memory that any |
| `codebloat` application touched during its execution: |
| |
| ```sql |
| SELECT |
| p.name AS process_name, |
| max(c.value)/1024.0/1024.0 AS max_rss_file_mb |
| FROM counter c |
| JOIN process_counter_track t ON c.track_id = t.id |
| JOIN process p USING (upid) |
| WHERE p.name LIKE 'com.android.codebloat%' AND t.name = 'mem.rss.file' |
| GROUP BY p.name; |
| ``` |
| |
| ## ART Compilation Modes and Memory |
| |
| The Android Runtime (ART) can compile your application code in one of several |
| different modes, also known as |
| [compiler filters](https://source.android.com/docs/core/runtime/configure#compiler_filters). |
| The selected compiler filter has a direct impact on your app's memory footprint. |
| |
| * **`verify`**: ART only performs bytecode verification. No AOT compilation is |
| performed. Code is executed via the **Interpreter** or compiled at runtime |
| by the **JIT** compiler. |
| * **Memory Impact**: Smallest on-disk size. Native code memory usage is |
| pushed into the `JIT Cache` (anonymous dirty memory). |
| * **`speed`**: ART performs full AOT compilation of all methods. |
| * **Memory Impact**: Largest `.odex` size. Maximizes **file-backed (clean) |
| memory** usage. |
| * **`speed-profile`**: ART only compiles methods that have been marked as |
| "hot" in a **startup profile**. |
| * **Memory Impact**: Balanced approach. Only the most critical code is AOT |
| compiled. |
| |
| The most common filter is `speed-profile`, which is used when installing user |
| apps. This is configured in the system properties `pm.dexopt.install` and |
| `pm.dexopt.bg-dexopt`, and is typically set in |
| `build/make/target/product/runtime_libart.mk`. |
| |
| Some system apps will use `speed` compilation, and will also compile at system |
| image build time. `verify` is typically only used in development use cases. |
| |
| Use case | Typical compiler filter |
| ------------ | ----------------------- |
| Development | `verify` |
| System image | `speed` |
| User apps | `speed-profile` |
| |
| ### Hands-on Exercise: Compilation Modes and Memory |
| |
| We can use the `CodeBloat` app to see how these filters affect memory. To |
| reproduce these measurements: |
| |
| 1. Force re-compile the app into the target mode. |
| 2. Force-stop and cold-start the app. |
| 3. Wait for the background thread to finish touching classes (watch logcat or |
| wait 5s). |
| 4. Run `adb shell dumpsys meminfo com.android.codebloat`. |
| |
| *Note: These measurements were taken on a Pixel 10 Pro Fold. Actual numbers will |
| vary by device; these are for scale.* |
| |
| **Mode: `verify` (No AOT)** |
| |
| ```bash |
| adb shell cmd package compile -m verify -f com.android.codebloat |
| adb shell am force-stop com.android.codebloat |
| adb shell am start -W -n com.android.codebloat/.MainActivity |
| sleep 5 |
| adb shell dumpsys meminfo com.android.codebloat |
| ``` |
| |
| In `verify` mode, the app Summary shows: * **Code PSS**: ~8,000 KB * **Dalvik |
| Other (JIT)**: ~25,000 KB |
| |
| Because no code is compiled AOT, the runtime must JIT-compile hot methods into |
| the **JIT Cache**, which shows up as **dirty anonymous memory** (`Dalvik |
| Other`). |
| |
| **Mode: `speed` (Full AOT)** |
| |
| ```bash |
| adb shell cmd package compile -m speed -f com.android.codebloat |
| adb shell am force-stop com.android.codebloat |
| adb shell am start -W -n com.android.codebloat/.MainActivity |
| sleep 5 |
| adb shell dumpsys meminfo com.android.codebloat |
| ``` |
| |
| In `speed` mode, the results shift dramatically: * **Code PSS**: ~24,000 KB * |
| **Dalvik Other (JIT)**: ~5,000 KB |
| |
| The application's code is now mapped from the `.odex` file as **clean |
| file-backed memory**. This reduces pressure on the JIT cache and makes the |
| memory eligible for eviction under pressure, rather than being "stuck" as dirty |
| RAM. |
| |
| **Mode: `speed-profile` (Selective AOT)** |
| |
| Modern apps may bundle a `baseline.prof` startup profile. ART uses this to |
| selectively compile only the code needed for a fast, memory-efficient startup. |
| |
| In this exercise we will create a baseline profile to list the app's startup |
| classes. However in reality the compiler may also receive profiles from external |
| sources such as from the application store ("cloud profiles"), which can provide |
| crowdsourced startup profiles for apps regardless of whether the developer also |
| bundled a baseline profile that they generated. |
| |
| #### Generating and Using On-device Profiles |
| |
| To see the impact of `speed-profile`, you can generate your own profile |
| on-device: |
| |
| 1. **Reset and Start**: |
| |
| ```bash |
| adb shell am force-stop com.android.codebloat |
| ``` |
| |
| 2. **Interact**: Start the app and let it run its startup sequence. |
| |
| 3. **Dump Profile**: |
| |
| ```bash |
| adb shell kill -s SIGUSR1 $(pidof com.android.codebloat) |
| ``` |
| |
| (This forces the app to write its current profile to disk). |
| |
| 4. **Install Profile**: |
| |
| ```bash |
| adb shell cp /data/misc/profiles/cur/0/com.android.codebloat/primary.prof \ |
| /data/misc/profiles/ref/com.android.codebloat/primary.prof |
| ``` |
| |
| 5. **Compile**: |
| |
| ```bash |
| adb shell cmd package compile -m speed-profile -f com.android.codebloat |
| ``` |
| |
| When you launch again, you'll see a balance: **Code PSS** will be lower than |
| `speed` (e.g., ~16,000 KB) because only the "hot" startup methods were compiled, |
| leaving the rest to be handled by the interpreter or JIT only if they are ever |
| actually used. |
| |
| See: |
| |
| * [Baseline Profiles overview](https://developer.android.com/topic/performance/baselineprofiles/overview) |
| * [Difference between Baseline Profiles and Startup Profiles](https://developer.android.com/topic/performance/baselineprofiles/difference-baseline-startup) |
| |
| ## Deep Dive into Compiled Code |
| |
| If you want to see exactly what instructions ART is generating, refer to the |
| [Disassembly Guide](../../../art/DISASSEMBLY_GUIDE.md). |
| |
| It provides detailed instructions on using: |
| |
| * **`oatdump`**: To see ARM64 instructions inside an existing `.odex` file. |
| * **`dex2oat`**: To simulate compilation with verbose debug flags. |
| |
| ### Exercise: Code Inlining |
| |
| One reason compiled code can grow unexpectedly is **method inlining**. The |
| compiler may decide to copy the body of a small, frequently called method |
| directly into its callers. |
| |
| In our `CodeBloat` app, the `doSomething()` method in every generated class |
| simply calls `method0()`. When compiled in `speed` mode, ART's Optimizing |
| compiler will likely inline `method0()` into `doSomething()`. |
| |
| **Exercise:** Verify this using `oatdump` on your device: |
| |
| ```bash |
| # 1. Find the path to the application's APK and compiled .odex file |
| adb shell pm path com.android.codebloat |
| # Output: package:/data/app/~~.../base.apk |
| |
| adb shell "dumpsys package com.android.codebloat | grep 'location is' | head -n 1" |
| # Example output: [location is /data/app/~~.../oat/arm64/base.odex] |
| |
| # 2. Run oatdump (substituting the correct path to base.odex) |
| adb shell oatdump --oat-file=/data/app/~~.../oat/arm64/base.odex \ |
| --class-filter=com.android.codebloat.GeneratedClass0 |
| ``` |
| |
| Look for the `doSomething` method in the output. If it was inlined, you will see |
| the instructions to load the long string constant directly within `doSomething`, |
| rather than a `bl` instruction targeting `method0`. |
| |
| #### Visualizing the Optimization (CFG) |
| |
| To see exactly when the compiler decided to inline the method, you can produce a |
| **Control Flow Graph (CFG)**. This shows the state of the code at every stage of |
| the optimization pipeline, with every transformation over the compiler's |
| Intermediate Representation (IR) until the code is lowered to the target ISA |
| (e.g. ARM64). |
| |
| 1. **Run `dex2oat` with dump flags**: Use the `--verbose-methods` flag to limit |
| the output to specific methods; otherwise, the `.cfg` file for a large app |
| can grow to several gigabytes. |
| |
| ```bash |
| # Substitution of actual paths required: |
| adb shell dex2oat64 --dex-file=/data/app/~~.../base.apk \ |
| --oat-file=/data/local/tmp/dump.odex \ |
| --compiler-filter=speed \ |
| --dump-cfg=/data/local/tmp/codebloat.cfg \ |
| --verbose-methods=doSomething |
| ``` |
| |
| 2. **Pull and View**: Pull the `.cfg` file to your workstation and open it with |
| [IR Hydra](https://mrale.ph/irhydra/1/). |
| |
| 3. **Find the Inliner**: In IR Hydra, load the compilation artifacts and search |
| for `doSomething`. Compare the representation before and after the |
| **Inliner** pass. You will see the graph expand as the instructions from |
| `method0` are merged into the caller. |
| |
| Alternatively, use the **Opt Pipeline** tool in **Compiler Explorer** (as |
| described in the section above) and enter similar code to see a similar |
| transformation performed at the **Inliner** pass. |
| |
| ### Exercise: Volatile Fields and Memory Barriers |
| |
| In the `MemoryLab` app, the `mGarbageSink` field is marked as `volatile`. This |
| ensures that the compiler doesn't optimize away our garbage allocations. |
| |
| ```java |
| public volatile byte[] mGarbageSink; |
| ``` |
| |
| In the ARM64 disassembly, you will see that every store to this field is |
| accompanied by a **Memory Barrier** (`dmb ish`) or use of |
| **Load-Acquire/Store-Release** instructions (`ldar`/`stlr`). This ensures thread |
| visibility but adds a few extra instructions to every access, increasing the |
| code size slightly compared to a regular field. |
| |
| **Exercise:** Find the field accesses and the associated memory barriers in the |
| disassembly. |
| |
| ### Exercise: Implicit Suspend Checks |
| |
| If you disassemble a loop, like the one in `generateAllocationChurn`, you will |
| notice a curious instruction at the end of the loop body: |
| |
| ```asm |
| ldr x21, [x21] |
| ``` |
| |
| This is an **Implicit Suspend Check**. ART uses this to allow the Garbage |
| Collector to pause threads safely. Register `x21` normally points to itself. |
| When the GC needs to suspend the thread, it "poisons" that memory location. The |
| next time the thread executes that `ldr`, it will trigger a fault, which the |
| runtime catches and uses to transition the thread into a suspended state. |
| |
| This pattern is repeated in every loop and at the start of every method, |
| contributing to the total code size of your application. |
| |
| **Exercise:** Find all the implicit suspend checks in the method disassembly, |
| and try to correlate them to the original source code. |
| |
| ________________________________________________________________________________ |
| |
| **Next: [Threads and Memory](threads.md)** |