Mark Kernel and KernelKey constructors constexpr (#17909) Enables constant initialization of static Kernel arrays (e.g. the prim_ops array in register_prim_ops.cpp). With constexpr constructors, the compiler places the fully initialized Kernel data directly into .data.rel.ro instead of generating a static initialization function to construct each Kernel object at startup. This shrinks _GLOBAL__sub_I_register_prim_ops.cpp from 1,010 bytes to 35 bytes (just the register_kernels call), and also eliminates startup latency from Kernel construction. Reduces stripped size_test by 3,488 bytes and stripped size_test_all_ops by 2,656 bytes. --------- Co-authored-by: Github Executorch <github_executorch@arm.com>
ExecuTorch is PyTorch‘s unified solution for deploying AI models on-device—from smartphones to microcontrollers—built for privacy, performance, and portability. It powers Meta’s on-device AI across Instagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses, and more.
Deploy LLMs, vision, speech, and multimodal models with the same PyTorch APIs you already know—accelerating research to production with seamless model export, optimization, and deployment. No manual C++ rewrites. No format conversions. No vendor lock-in.
ExecuTorch uses ahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:
torch.export().pte.pte on-device via lightweight C++ runtimeModels use a standardized Core ATen operator set. Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.
Learn more: How ExecuTorch Works • Architecture Guide
pip install executorch
For platform-specific setup (Android, iOS, embedded systems), see the Quick Start documentation for additional info.
import torch from executorch.exir import to_edge_transform_and_lower from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner # 1. Export your PyTorch model model = MyModel().eval() example_inputs = (torch.randn(1, 3, 224, 224),) exported_program = torch.export.export(model, example_inputs) # 2. Optimize for target hardware (switch backends with one line) program = to_edge_transform_and_lower( exported_program, partitioner=[XnnpackPartitioner()] # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm ).to_executorch() # 3. Save for deployment with open("model.pte", "wb") as f: f.write(program.buffer) # Test locally via ExecuTorch runtime's pybind API (optional) from executorch.runtime import Runtime runtime = Runtime.get() method = runtime.load_program("model.pte").load_method("forward") outputs = method.execute([torch.randn(1, 3, 224, 224)])
#include <executorch/extension/module/module.h> #include <executorch/extension/tensor/tensor.h> Module module("model.pte"); auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f}); auto outputs = module.forward(tensor);
import ExecuTorch let module = Module(filePath: "model.pte") let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2]) let outputs = try module.forward(input)
val module = Module.load("model.pte") val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2)) val outputs = module.forward(EValue.from(inputTensor))
Export Llama models using the export_llm script or Optimum-ExecuTorch:
# Using export_llm python -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte # Using Optimum-ExecuTorch optimum-cli export executorch \ --model meta-llama/Llama-3.2-1B \ --task text-generation \ --recipe xnnpack \ --output_dir llama_model
Run on-device with the LLM runner API:
#include <executorch/extension/llm/runner/text_llm_runner.h> auto runner = create_llama_runner("llama.pte", "tiktoken.bin"); executorch::extension::llm::GenerationConfig config{ .seq_len = 128, .temperature = 0.8f}; runner->generate("Hello, how are you?", config);
import ExecuTorchLLM let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin") try runner.generate("Hello, how are you?", Config { $0.sequenceLength = 128 }) { token in print(token, terminator: "") }
Kotlin (Android) — API Docs • Demo App
val llmModule = LlmModule("llama.pte", "tiktoken.bin", 0.8f) llmModule.load() llmModule.generate("Hello, how are you?", 128, object : LlmCallback { override fun onResult(result: String) { print(result) } override fun onStats(stats: String) { } })
For multimodal models (vision, audio), use the MultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. See Llava and Voxtral examples.
See examples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.
Next Steps:
| Platform | Supported Backends |
|---|---|
| Android | XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos |
| iOS | XNNPACK, MPS, CoreML (Neural Engine) |
| Linux / Windows | XNNPACK, OpenVINO, CUDA (experimental) |
| macOS | XNNPACK, MPS, Metal (experimental) |
| Embedded / MCU | XNNPACK, ARM Ethos-U, NXP, Cadence DSP |
See Backend Documentation for detailed hardware requirements and optimization guides. For desktop/laptop GPU inference with CUDA and Metal, see the Desktop Guide. For Zephyr RTOS integration, see the Zephyr Guide.
ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments. View success stories →
LLMs: Llama 3.2/3.1/3, Qwen 3, Phi-4-mini, LiquidAI LFM2
Multimodal: Llava (vision-language), Voxtral (audio-language), Gemma (vision-language)
Vision/Speech: MobileNetV2, DeepLabV3, Whisper
Resources: examples/ directory • executorch-examples out-of-tree demos • Optimum-ExecuTorch for HuggingFace models • Unsloth for fine-tuned LLM deployment
ExecuTorch provides advanced capabilities for production deployment:
See Advanced Topics for quantization techniques, custom backends, and compiler passes.
We welcome contributions from the community!
ExecuTorch is BSD licensed, as found in the LICENSE file.