fix eval llama (#4469)

Summary:
Pull Request resolved: https://github.com/pytorch/executorch/pull/4469

Previously the refactor moves files from `examples/...` to `extensions/...`, however llama eval was not covered by CI, fix it here

before:
```
(executorch) chenlai@chenlai-mbp executorch % python -m examples.models.llama2.eval_llama -c /Users/chenlai/Documents/stories110M/stories110M/stories110M.pt  -p /Users/chenlai/Documents/stories110M/stories110M/params.json  -t /Users/chenlai/Documents/stories110M/stories110M/tokenizer.model  -d fp32 --max_seq_len 127 --limit 5
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:106: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_byte.out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:153: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_byte.dtype_out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:228: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_4bit.out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:281: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_4bit.dtype_out")
Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/chenlai/executorch/examples/models/llama2/eval_llama.py", line 13, in <module>
    from .eval_llama_lib import build_args_parser, eval_llama
  File "/Users/chenlai/executorch/examples/models/llama2/eval_llama_lib.py", line 19, in <module>
    from executorch.extension.llm.export import LLMEdgeManager
ImportError: cannot import name 'LLMEdgeManager' from 'executorch.extension.llm.export' (/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/extension/llm/export/__init__.py)
(executorch) chenlai@chenlai-mbp executorch %
(executorch) chenlai@chenlai-mbp executorch %
```
after

```
(executorch) chenlai@chenlai-mbp executorch % python -m examples.models.llama2.eval_llama -c /Users/chenlai/Documents/stories110M/stories110M/stories110M.pt  -p /Users/chenlai/Documents/stories110M/stories110M/params.json  -t /Users/chenlai/Documents/stories110M/stories110M/tokenizer.model  -d fp32 --max_seq_len 127 --limit 5
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:106: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_byte.out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:153: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_byte.dtype_out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:228: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_4bit.out")
/opt/homebrew/anaconda3/envs/executorch/lib/python3.10/site-packages/executorch/exir/passes/_quant_patterns_and_replacements.py:281: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  impl_abstract("quantized_decomposed::embedding_4bit.dtype_out")
2024-07-30:12:36:04,260 INFO     [tokenizer.py:33] #words: 32000 - BOS ID: 1 - EOS ID: 2
2024-07-30:12:36:04,260 INFO     [export_llama_lib.py:419] Applying quantizers: []
2024-07-30:12:36:04,260 INFO     [export_llama_lib.py:594] Loading model with checkpoint=/Users/chenlai/Documents/stories110M/stories110M/stories110M.pt, params=/Users/chenlai/Documents/stories110M/stories110M/params.json, use_kv_cache=False, weight_type=WeightType.LLAMA
/Users/chenlai/executorch/examples/models/llama2/model.py:99: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=device, mmap=True)
2024-07-30:12:36:04,315 INFO     [export_llama_lib.py:616] Loaded model with dtype=torch.float32
2024-07-30:12:36:04,395 INFO     [huggingface.py:162] Using device 'cpu'
2024-07-30:12:36:27,262 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-07-30:12:36:27,262 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-07-30:12:36:27,262 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-07-30:12:36:27,262 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-07-30:12:36:27,262 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-07-30:12:36:27,262 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-07-30:12:36:29,494 WARNING  [repocard.py:107] Repo card metadata block was not found. Setting CardData to empty.
2024-07-30:12:36:30,401 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 718.57it/s]
2024-07-30:12:36:30,410 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:14<00:00,  2.91s/it]
wikitext: {'word_perplexity,none': 10885.215324239069, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 6.144013518032613, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.6191813902741017, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
ghstack-source-id: 235865354
exported-using-ghexport

Reviewed By: larryliu0820

Differential Revision: D60466386

fbshipit-source-id: 0032af8b3269f107469fe142382dfacb06751808
1 file changed
tree: 6829cf84dfbde7c04bdcb2c7c8fb458a910d2fbb
  1. .ci/
  2. .github/
  3. backends/
  4. build/
  5. codegen/
  6. configurations/
  7. docs/
  8. examples/
  9. exir/
  10. extension/
  11. kernels/
  12. profiler/
  13. runtime/
  14. schema/
  15. scripts/
  16. sdk/
  17. shim/
  18. test/
  19. third-party/
  20. util/
  21. .buckconfig
  22. .clang-format
  23. .clang-tidy
  24. .cmake-format.yaml
  25. .cmakelintrc
  26. .flake8
  27. .gitignore
  28. .gitmodules
  29. .lintrunner.toml
  30. CMakeLists.txt
  31. CODE_OF_CONDUCT.md
  32. CONTRIBUTING.md
  33. install_requirements.sh
  34. LICENSE
  35. pyproject.toml
  36. pytest.ini
  37. README-wheel.md
  38. README.md
  39. requirements-lintrunner.txt
  40. setup.py
  41. version.txt
README.md

ExecuTorch

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices.

Key value propositions of ExecuTorch are:

  • Portability: Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers.
  • Productivity: Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms.
  • Performance: Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.

For a comprehensive technical overview of ExecuTorch and step-by-step tutorials, please visit our documentation website for the latest release (or the main branch).

Check out the Getting Started page for a quick spin.

Feedback

We welcome any feedback, suggestions, and bug reports from the community to help us improve our technology. Please use the PyTorch Forums for discussion and feedback about ExecuTorch using the ExecuTorch category, and our GitHub repository for bug reporting.

We recommend using the latest release tag from the Releases page when developing.

Contributing

See CONTRIBUTING.md for details about issues, PRs, code style, CI jobs, and other development topics.

Directory Structure

executorch
├── backends                        #  Backend delegate implementations.
├── build                           #  Utilities for managing the build system.
├── codegen                         #  Tooling to autogenerate bindings between kernels and the runtime.
├── configurations
├── docs                            #  Static docs tooling.
├── examples                        #  Examples of various user flows, such as model export, delegates, and runtime execution.
├── exir                            #  Ahead-of-time library: model capture and lowering APIs.
|   ├── _serialize                  #  Serialize final export artifact.
|   ├── backend                     #  Backend delegate ahead of time APIs
|   ├── capture                     #  Program capture.
|   ├── dialects                    #  Op sets for various dialects in the export process.
|   ├── emit                        #  Conversion from ExportedProgram to ExecuTorch execution instructions.
|   ├── operator                    #  Operator node manipulation utilities.
|   ├── passes                      #  Built-in compiler passes.
|   ├── program                     #  Export artifacts.
|   ├── serde                       #  Graph module
serialization/deserialization.
|   ├── verification                #  IR verification.
├── extension                       #  Extensions built on top of the runtime.
|   ├── android                     #  ExecuTorch wrappers for Android apps.
|   ├── apple                       #  ExecuTorch wrappers for iOS apps.
|   ├── aten_util                   #  Converts to and from PyTorch ATen types.
|   ├── data_loader                 #  1st party data loader implementations.
|   ├── evalue_util                 #  Helpers for working with EValue objects.
|   ├── gguf_util                   #  Tools to convert from the GGUF format.
|   ├── kernel_util                 #  Helpers for registering kernels.
|   ├── memory_allocator            #  1st party memory allocator implementations.
|   ├── module                      #  A simplified C++ wrapper for the runtime.
|   ├── parallel                    #  C++ threadpool integration.
|   ├── pybindings                  #  Python API for executorch runtime.
|   ├── pytree                      #  C++ and Python flattening and unflattening lib for pytrees.
|   ├── runner_util                 #  Helpers for writing C++ PTE-execution
tools.
|   ├── testing_util                #  Helpers for writing C++ tests.
|   ├── training                    #  Experimental libraries for on-device training
├── kernels                         #  1st party kernel implementations.
|   ├── aten
|   ├── optimized
|   ├── portable                    #  Reference implementations of ATen operators.
|   ├── prim_ops                    #  Special ops used in executorch runtime for control flow and symbolic primitives.
|   ├── quantized
├── profiler                        #  Utilities for profiling runtime execution.
├── runtime                         #  Core C++ runtime.
|   ├── backend                     #  Backend delegate runtime APIs.
|   ├── core                        #  Core structures used across all levels of the runtime.
|   ├── executor                    #  Model loading, initalization, and execution.
|   ├── kernel                      #  Kernel registration and management.
|   ├── platform                    #  Layer between architecture specific code and portable C++.
├── schema                          #  ExecuTorch PTE file format flatbuffer
schemas.
├── scripts                         #  Utility scripts for size management, dependency management, etc.
├── sdk                             #  Model profiling, debugging, and introspection.
├── shim                            #  Compatibility layer between OSS and Internal builds
├── test                            #  Broad scoped end-to-end tests.
├── third-party                     #  Third-party dependencies.
├── util                            #  Various helpers and scripts.

License

ExecuTorch is BSD licensed, as found in the LICENSE file.