This example demonstrates how to Export a Llama 2 model in ExecuTorch such that it can be used in a mobile environment. For Llama2, please refer to the llama's github page for details. Pretrained parameters are not included in this repo. Users are suggested to download them through the llama's download page.
Llama is a family of large language models that uses publicly available data for training. These models are based on the transformer architecture, which allows it to process input sequences of arbitrary length and generate output sequences of variable length. One of the key features of Llama models is its ability to generate coherent and contextually relevant text. This is achieved through the use of attention mechanisms, which allow the model to focus on different parts of the input sequence as it generates output. Additionally, Llama models use a technique called “masked language modeling” to pre-train the model on a large corpus of text, which helps it learn to predict missing words in a sentence.
Llama models have shown to perform well on a variety of natural language processing tasks, including language translation, question answering, and text summarization and are also capable of generating human-like text, making Llama models a useful tool for creative writing and other applications where natural language generation is important.
Overall, Llama models are powerful and versatile language models that can be used for a wide range of natural language processing tasks. The model’s ability to generate coherent and contextually relevant text makes it particularly useful for applications such as chatbots, virtual assistants, and language translation.
Please note that the models are subject to the acceptable use policy and the provided responsible use guide.
--checkpoint <checkpoint> and --params <params> for custom checkpoints.This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
cd examples/third-party/llamapip install -e .executorch root, run bash examples/models/llama2/install_requirements.sh.executorch root, run python3 -m examples.models.llama2.export_llama. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>.Download stories110M.pt and tokenizer.model from Github.
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
Create params file.
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
Export model. Export options available here.
python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
Create tokenizer.bin.
Build with buck2:
buck2 run examples/models/llama2/tokenizer:tokenizer_py -- -t tokenizer.model -o tokenizer.bin
Build with cmake: todo
Run model. Run options available here. Build with buck2:
buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
Build with cmake: todo
See test script here.