backends/vulkan/docs/android_demo.md - platform/external/executorch - Git at Google

 # Building and Running ExecuTorch with the Vulkan Backend

 The [ExecuTorch Vulkan Delegate](./native-delegates-executorch-vulkan-delegate.md)
 is a native GPU delegate for ExecuTorch.

 <!----This will show a grid card on the page----->
 ::::{grid} 2
 :::{grid-item-card}  What you will learn in this tutorial:
 :class-card: card-content
 * How to export the Llama3.2-1B parameter model with partial GPU delegation
 * How to execute the partially delegated model on Android
 :::
 :::{grid-item-card}  Prerequisites:
 :class-card: card-prerequisites
 * Follow [**Setting up ExecuTorch**](./getting-started-setup.md)
 * It is also recommended that you read through [**ExecuTorch Vulkan Delegate**](./native-delegates-executorch-vulkan-delegate.md) and follow the example in that page
 :::
 ::::

 ## Prerequisites

 Note that all the steps below should be performed from the ExecuTorch repository
 root directory, and assumes that you have gone through the steps of setting up
 ExecuTorch.

 It is also assumed that the Android NDK and Android SDK is installed, and the
 following environment examples are set.

 ```shell
 export ANDROID_NDK=<path_to_ndk>
 # Select an appropriate Android ABI for your device
 export ANDROID_ABI=arm64-v8a
 # All subsequent commands should be performed from ExecuTorch repo root
 cd <path_to_executorch_root>
 # Make sure adb works
 adb --version
 ```

 ## Lowering the Llama3.2-1B model to Vulkan

 ::::{note}
 The resultant model will only be partially delegated to the Vulkan backend. In
 particular, only binary arithmetic operators (`aten.add`, `aten.sub`,
 `aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`),
 and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan
 delegate. The rest of the model will be executed using Portable operators.

 Operator support for LLaMA models is currently in active development; please
 check out the `main` branch of the ExecuTorch repo for the latest capabilities.
 ::::

 First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model`
 files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/).

 Once the files have been downloaded, the `export_llama` script can be used to
 partially lower the Llama model to Vulkan.

 ```shell
 # The files will usually be downloaded to ~/.llama
 python -m examples.models.llama.export_llama \
   --disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \
   -c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
   -p ~/.llama/checkpoints/Llama3.2-1B/params.json \
   --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
 ```

 A `vulkan_llama2.pte` file should have been created as a result of running the
 script.

 Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device:

 ```shell
 adb push ~/.llama/tokenizer.model /data/local/tmp/
 adb push vulkan_llama2.pte /data/local/tmp/
 ```

 ## Build and Run the LLaMA runner binary on Android

 First, build and install ExecuTorch libraries, then build the LLaMA runner
 binary using the Android NDK toolchain.

 ```shell
 (rm -rf cmake-android-out && \
   cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
     -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
     -DANDROID_ABI=$ANDROID_ABI \
     -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
     -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
     -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
     -DEXECUTORCH_BUILD_VULKAN=ON \
     -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
     -DPYTHON_EXECUTABLE=python \
     -Bcmake-android-out && \
   cmake --build cmake-android-out -j16 --target install)

 # Build LLaMA Runner library
 (rm -rf cmake-android-out/examples/models/llama && \
   cmake examples/models/llama \
     -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
     -DANDROID_ABI=$ANDROID_ABI \
     -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
     -DCMAKE_INSTALL_PREFIX=cmake-android-out \
     -DPYTHON_EXECUTABLE=python \
     -Bcmake-android-out/examples/models/llama && \
   cmake --build cmake-android-out/examples/models/llama -j16)
 ```

 Finally, push and run the llama runner binary on your Android device. Note that
 your device must have sufficient GPU memory to execute the model.

 ```shell
 adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main

 adb shell /data/local/tmp/llama_main \
     --model_path=/data/local/tmp/vulkan_llama2.pte \
     --tokenizer_path=/data/local/tmp/tokenizer.model \
     --prompt "Hello"
 ```

 Note that currently model inference will be very slow due to the high amount of
 delegate blobs in the lowered graph, which requires a transfer to and from the
 GPU for each sub graph. Performance is expected to improve drastically as more
 of the model can be lowered to the Vulkan delegate, and techniques such as
 quantization are supported.
	# Building and Running ExecuTorch with the Vulkan Backend

	The [ExecuTorch Vulkan Delegate](./native-delegates-executorch-vulkan-delegate.md)
	is a native GPU delegate for ExecuTorch.

	<!----This will show a grid card on the page----->
	::::{grid} 2
	:::{grid-item-card} What you will learn in this tutorial:
	:class-card: card-content
	* How to export the Llama3.2-1B parameter model with partial GPU delegation
	* How to execute the partially delegated model on Android
	:::
	:::{grid-item-card} Prerequisites:
	:class-card: card-prerequisites
	* Follow [Setting up ExecuTorch](./getting-started-setup.md)
	* It is also recommended that you read through [ExecuTorch Vulkan Delegate](./native-delegates-executorch-vulkan-delegate.md) and follow the example in that page
	:::
	::::

	## Prerequisites

	Note that all the steps below should be performed from the ExecuTorch repository
	root directory, and assumes that you have gone through the steps of setting up
	ExecuTorch.

	It is also assumed that the Android NDK and Android SDK is installed, and the
	following environment examples are set.

	```shell
	export ANDROID_NDK=<path_to_ndk>
	# Select an appropriate Android ABI for your device
	export ANDROID_ABI=arm64-v8a
	# All subsequent commands should be performed from ExecuTorch repo root
	cd <path_to_executorch_root>
	# Make sure adb works
	adb --version
	```

	## Lowering the Llama3.2-1B model to Vulkan

	::::{note}
	The resultant model will only be partially delegated to the Vulkan backend. In
	particular, only binary arithmetic operators (`aten.add`, `aten.sub`,
	`aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`),
	and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan
	delegate. The rest of the model will be executed using Portable operators.

	Operator support for LLaMA models is currently in active development; please
	check out the `main` branch of the ExecuTorch repo for the latest capabilities.
	::::

	First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model`
	files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/).

	Once the files have been downloaded, the `export_llama` script can be used to
	partially lower the Llama model to Vulkan.

	```shell
	# The files will usually be downloaded to ~/.llama
	python -m examples.models.llama.export_llama \
	--disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \
	-c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
	-p ~/.llama/checkpoints/Llama3.2-1B/params.json \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
	```

	A `vulkan_llama2.pte` file should have been created as a result of running the
	script.

	Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device:

	```shell
	adb push ~/.llama/tokenizer.model /data/local/tmp/
	adb push vulkan_llama2.pte /data/local/tmp/
	```

	## Build and Run the LLaMA runner binary on Android

	First, build and install ExecuTorch libraries, then build the LLaMA runner
	binary using the Android NDK toolchain.

	```shell
	(rm -rf cmake-android-out && \
	cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
	-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
	-DANDROID_ABI=$ANDROID_ABI \
	-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
	-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
	-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
	-DEXECUTORCH_BUILD_VULKAN=ON \
	-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
	-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
	-DPYTHON_EXECUTABLE=python \
	-Bcmake-android-out && \
	cmake --build cmake-android-out -j16 --target install)

	# Build LLaMA Runner library
	(rm -rf cmake-android-out/examples/models/llama && \
	cmake examples/models/llama \
	-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
	-DANDROID_ABI=$ANDROID_ABI \
	-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
	-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
	-DCMAKE_INSTALL_PREFIX=cmake-android-out \
	-DPYTHON_EXECUTABLE=python \
	-Bcmake-android-out/examples/models/llama && \
	cmake --build cmake-android-out/examples/models/llama -j16)
	```

	Finally, push and run the llama runner binary on your Android device. Note that
	your device must have sufficient GPU memory to execute the model.

	```shell
	adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main

	adb shell /data/local/tmp/llama_main \
	--model_path=/data/local/tmp/vulkan_llama2.pte \
	--tokenizer_path=/data/local/tmp/tokenizer.model \
	--prompt "Hello"
	```

	Note that currently model inference will be very slow due to the high amount of
	delegate blobs in the lowered graph, which requires a transfer to and from the
	GPU for each sub graph. Performance is expected to improve drastically as more
	of the model can be lowered to the Vulkan delegate, and techniques such as
	quantization are supported.