| # String interning |
| |
| *Interned* strings are conceptually part of an interpreter-global |
| *set* of interned strings, meaning that: |
| - no two interned strings have the same content (across an interpreter); |
| - two interned strings can be safely compared using pointer equality |
| (Python `is`). |
| |
| This is used to optimize dict and attribute lookups, among other things. |
| |
| Python uses three different mechanisms to intern strings: |
| |
| - Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros. |
| These are statically allocated, and collected using `make regen-global-objects` |
| (`Tools/build/generate_global_objects.py`), which generates code |
| for declaration, initialization and finalization. |
| |
| The difference between the two kinds is not important. (A `_Py_ID` string is |
| a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain |
| non-identifier characters, so it needs a separate C-compatible name.) |
| |
| The empty string is in this category (as `_Py_STR(empty)`). |
| |
| These singletons are interned in a runtime-global lookup table, |
| `_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
| at runtime initialization. |
| |
| - The 256 possible one-character latin-1 strings are singletons, |
| which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global |
| arrays, `_PyRuntime.static_objects.strings.ascii` and |
| `_PyRuntime.static_objects.strings.latin1`. |
| |
| These are NOT interned at startup in the normal build. |
| In the free-threaded build, they are; this avoids modifying the |
| global lookup table after threads are started. |
| |
| Interning a one-char latin-1 string will always intern the corresponding |
| singleton. |
| |
| - All other strings are allocated dynamically, and have their |
| `_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
| When interned, such strings are added to an interpreter-wide dict, |
| `PyInterpreterState.cached_objects.interned_strings`. |
| |
| The key and value of each entry in this dict reference the same object. |
| |
| The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`) |
| are disjoint. |
| If you have such a singleton, it (and no other copy) will be interned. |
| |
| |
| ## Immortality and reference counting |
| |
| Invariant: Every immortal string is interned, *except* the one-char latin-1 |
| singletons (which might but might not be interned). |
| |
| In practice, this means that you must not use `_Py_SetImmortal` on |
| a string. (If you know it's already immortal, don't immortalize it; |
| if you know it's not interned you might be immortalizing a redundant copy; |
| if it's interned and mortal it needs extra processing in |
| `_PyUnicode_InternImmortal`.) |
| |
| The converse is not true: interned strings can be mortal. |
| For mortal interned strings: |
| - the 2 references from the interned dict (key & value) are excluded from |
| their refcount |
| - the deallocator (`unicode_dealloc`) removes the string from the interned dict |
| - at shutdown, when the interned dict is cleared, the references are added back |
| |
| As with any type, you should only immortalize strings that will live until |
| interpreter shutdown. |
| We currently also immortalize strings contained in code objects and similar, |
| specifically in the compiler and in `marshal`. |
| These are “close enough” to immortal: even in use cases like hot reloading |
| or `eval`-ing user input, the number of distinct identifiers and string |
| constants expected to stay low. |
| |
| |
| ## Internal API |
| |
| We have the following *internal* API for interning: |
| |
| - `_PyUnicode_InternMortal`: just intern the string |
| - `_PyUnicode_InternImmortal`: intern, and immortalize the result |
| - `_PyUnicode_InternStatic`: intern a static singleton (`_Py_STR`, `_Py_ID` |
| or one-byte). Not for general use. |
| |
| All take an interpreter state, and a pointer to a `PyObject*` which they |
| modify in place. |
| |
| The functions take ownership of (“steal”) the reference to their argument, |
| and update the argument with a *new* reference. |
| This means: |
| - They're “reference neutral”. |
| - They must not be called with a borrowed reference. |
| |
| |
| ## State |
| |
| The intern state (retrieved by `PyUnicode_CHECK_INTERNED(s)`; |
| stored in `_PyUnicode_STATE(s).interned`) can be: |
| |
| - `SSTATE_NOT_INTERNED` (defined as 0, which is useful in a boolean context) |
| - `SSTATE_INTERNED_MORTAL` (1) |
| - `SSTATE_INTERNED_IMMORTAL` (2) |
| - `SSTATE_INTERNED_IMMORTAL_STATIC` (3) |
| |
| The valid transitions between these states are: |
| |
| - For dynamically allocated strings: |
| |
| - 0 -> 1 (`_PyUnicode_InternMortal`) |
| - 1 -> 2 or 0 -> 2 (`_PyUnicode_InternImmortal`) |
| |
| Using `_PyUnicode_InternStatic` on these is an error; the other cases |
| don't change the state. |
| |
| - One-char latin-1 singletons can be interned (0 -> 3) using any interning |
| function; after that the functions don't change the state. |
| |
| - Other statically allocated strings are interned (0 -> 3) at runtime init; |
| after that all interning functions don't change the state. |