| //! Support for tracking the last time files were used to assist with cleaning |
| //! up those files if they haven't been used in a while. |
| //! |
| //! Tracking of cache files is stored in a sqlite database which contains a |
| //! timestamp of the last time the file was used, as well as the size of the |
| //! file. |
| //! |
| //! While cargo is running, when it detects a use of a cache file, it adds a |
| //! timestamp to [`DeferredGlobalLastUse`]. This batches up a set of changes |
| //! that are then flushed to the database all at once (via |
| //! [`DeferredGlobalLastUse::save`]). Ideally saving would only be done once |
| //! for performance reasons, but that is not really possible due to the way |
| //! cargo works, since there are different ways cargo can be used (like `cargo |
| //! generate-lockfile`, `cargo fetch`, and `cargo build` are all very |
| //! different ways the code is used). |
| //! |
| //! All of the database interaction is done through the [`GlobalCacheTracker`] |
| //! type. |
| //! |
| //! There is a single global [`GlobalCacheTracker`] and |
| //! [`DeferredGlobalLastUse`] stored in [`GlobalContext`]. |
| //! |
| //! The high-level interface for performing garbage collection is defined in |
| //! the [`crate::core::gc`] module. The functions there are responsible for |
| //! interacting with the [`GlobalCacheTracker`] to handle cleaning of global |
| //! cache data. |
| //! |
| //! ## Automatic gc |
| //! |
| //! Some commands (primarily the build commands) will trigger an automatic |
| //! deletion of files that haven't been used in a while. The high-level |
| //! interface for this is the [`crate::core::gc::auto_gc`] function. |
| //! |
| //! The [`GlobalCacheTracker`] database tracks the last time an automatic gc |
| //! was performed so that it is only done once per day for performance |
| //! reasons. |
| //! |
| //! ## Manual gc |
| //! |
| //! The user can perform a manual garbage collection with the `cargo clean` |
| //! command. That command has a variety of options to specify what to delete. |
| //! Manual gc supports deleting based on age or size or both. From a |
| //! high-level, this is done by the [`crate::core::gc::Gc::gc`] method, which |
| //! calls into [`GlobalCacheTracker`] to handle all the cleaning. |
| //! |
| //! ## Locking |
| //! |
| //! Usage of the database requires that the package cache is locked to prevent |
| //! concurrent access. Although sqlite has built-in locking support, we want |
| //! to use cargo's locking so that the "Blocking" message gets displayed, and |
| //! so that locks can block indefinitely for long-running build commands. |
| //! [`rusqlite`] has a default timeout of 5 seconds, though that is |
| //! configurable. |
| //! |
| //! When garbage collection is being performed, the package cache lock must be |
| //! in [`CacheLockMode::MutateExclusive`] to ensure no other cargo process is |
| //! running. See [`crate::util::cache_lock`] for more detail on locking. |
| //! |
| //! When performing automatic gc, [`crate::core::gc::auto_gc`] will skip the |
| //! GC if the package cache lock is already held by anything else. Automatic |
| //! GC is intended to be opportunistic, and should impose as little disruption |
| //! to the user as possible. |
| //! |
| //! ## Compatibility |
| //! |
| //! The database must retain both forwards and backwards compatibility between |
| //! different versions of cargo. For the most part, this shouldn't be too |
| //! difficult to maintain. Generally sqlite doesn't change on-disk formats |
| //! between versions (the introduction of WAL is one of the few examples where |
| //! version 3 had a format change, but we wouldn't use it anyway since it has |
| //! shared-memory requirements cargo can't depend on due to things like |
| //! network mounts). |
| //! |
| //! Schema changes must be managed through [`migrations`] by adding new |
| //! entries that make a change to the database. Changes must not break older |
| //! versions of cargo. Generally, adding columns should be fine (either with a |
| //! default value, or NULL). Adding tables should also be fine. Just don't do |
| //! destructive things like removing a column, or changing the semantics of an |
| //! existing column. |
| //! |
| //! Since users may run older versions of cargo that do not do cache tracking, |
| //! the [`GlobalCacheTracker::sync_db_with_files`] method helps dealing with |
| //! keeping the database in sync in the presence of older versions of cargo |
| //! touching the cache directories. |
| //! |
| //! ## Performance |
| //! |
| //! A lot of focus on the design of this system is to minimize the performance |
| //! impact. Every build command needs to save updates which we try to avoid |
| //! having a noticeable impact on build times. Systems like Windows, |
| //! particularly with a magnetic hard disk, can experience a fairly large |
| //! impact of cargo's overhead. Cargo's benchsuite has some benchmarks to help |
| //! compare different environments, or changes to the code here. Please try to |
| //! keep performance in mind if making any major changes. |
| //! |
| //! Performance of `cargo clean` is not quite as important since it is not |
| //! expected to be run often. However, it is still courteous to the user to |
| //! try to not impact it too much. One part that has a performance concern is |
| //! that the clean command will synchronize the database with whatever is on |
| //! disk if needed (in case files were added by older versions of cargo that |
| //! don't do cache tracking, or if the user manually deleted some files). This |
| //! can potentially be very slow, especially if the two are very out of sync. |
| //! |
| //! ## Filesystems |
| //! |
| //! Everything here is sensitive to the kind of filesystem it is running on. |
| //! People tend to run cargo in all sorts of strange environments that have |
| //! limited capabilities, or on things like read-only mounts. The code here |
| //! needs to gracefully handle as many situations as possible. |
| //! |
| //! See also the information in the [Performance](#performance) and |
| //! [Locking](#locking) sections when considering different filesystems and |
| //! their impact on performance and locking. |
| //! |
| //! There are checks for read-only filesystems, which is generally ignored. |
| |
| use crate::core::gc::GcOpts; |
| use crate::core::Verbosity; |
| use crate::ops::CleanContext; |
| use crate::util::cache_lock::CacheLockMode; |
| use crate::util::interning::InternedString; |
| use crate::util::sqlite::{self, basic_migration, Migration}; |
| use crate::util::{Filesystem, Progress, ProgressStyle}; |
| use crate::{CargoResult, GlobalContext}; |
| use anyhow::{bail, Context as _}; |
| use cargo_util::paths; |
| use rusqlite::{params, Connection, ErrorCode}; |
| use std::collections::{hash_map, HashMap}; |
| use std::path::{Path, PathBuf}; |
| use std::time::{Duration, SystemTime}; |
| use tracing::{debug, trace}; |
| |
| /// The filename of the database. |
| const GLOBAL_CACHE_FILENAME: &str = ".global-cache"; |
| |
| const REGISTRY_INDEX_TABLE: &str = "registry_index"; |
| const REGISTRY_CRATE_TABLE: &str = "registry_crate"; |
| const REGISTRY_SRC_TABLE: &str = "registry_src"; |
| const GIT_DB_TABLE: &str = "git_db"; |
| const GIT_CO_TABLE: &str = "git_checkout"; |
| |
| /// How often timestamps will be updated. |
| /// |
| /// As an optimization timestamps are not updated unless they are older than |
| /// the given number of seconds. This helps reduce the amount of disk I/O when |
| /// running cargo multiple times within a short window. |
| const UPDATE_RESOLUTION: u64 = 60 * 5; |
| |
| /// Type for timestamps as stored in the database. |
| /// |
| /// These are seconds since the Unix epoch. |
| type Timestamp = u64; |
| |
| /// The key for a registry index entry stored in the database. |
| #[derive(Clone, Debug, Hash, Eq, PartialEq)] |
| pub struct RegistryIndex { |
| /// A unique name of the registry source. |
| pub encoded_registry_name: InternedString, |
| } |
| |
| /// The key for a registry `.crate` entry stored in the database. |
| #[derive(Clone, Debug, Hash, Eq, PartialEq)] |
| pub struct RegistryCrate { |
| /// A unique name of the registry source. |
| pub encoded_registry_name: InternedString, |
| /// The filename of the compressed crate, like `foo-1.2.3.crate`. |
| pub crate_filename: InternedString, |
| /// The size of the `.crate` file. |
| pub size: u64, |
| } |
| |
| /// The key for a registry src directory entry stored in the database. |
| #[derive(Clone, Debug, Hash, Eq, PartialEq)] |
| pub struct RegistrySrc { |
| /// A unique name of the registry source. |
| pub encoded_registry_name: InternedString, |
| /// The directory name of the extracted source, like `foo-1.2.3`. |
| pub package_dir: InternedString, |
| /// Total size of the src directory in bytes. |
| /// |
| /// This can be None when the size is unknown. For example, when the src |
| /// directory already exists on disk, and we just want to update the |
| /// last-use timestamp. We don't want to take the expense of computing disk |
| /// usage unless necessary. [`GlobalCacheTracker::populate_untracked`] |
| /// will handle any actual NULL values in the database, which can happen |
| /// when the src directory is created by an older version of cargo that |
| /// did not track sizes. |
| pub size: Option<u64>, |
| } |
| |
| /// The key for a git db entry stored in the database. |
| #[derive(Clone, Debug, Hash, Eq, PartialEq)] |
| pub struct GitDb { |
| /// A unique name of the git database. |
| pub encoded_git_name: InternedString, |
| } |
| |
| /// The key for a git checkout entry stored in the database. |
| #[derive(Clone, Debug, Hash, Eq, PartialEq)] |
| pub struct GitCheckout { |
| /// A unique name of the git database. |
| pub encoded_git_name: InternedString, |
| /// A unique name of the checkout without the database. |
| pub short_name: InternedString, |
| /// Total size of the checkout directory. |
| /// |
| /// This can be None when the size is unknown. See [`RegistrySrc::size`] |
| /// for an explanation. |
| pub size: Option<u64>, |
| } |
| |
| /// Filesystem paths in the global cache. |
| /// |
| /// Accessing these assumes a lock has already been acquired. |
| struct BasePaths { |
| /// Root path to the index caches. |
| index: PathBuf, |
| /// Root path to the git DBs. |
| git_db: PathBuf, |
| /// Root path to the git checkouts. |
| git_co: PathBuf, |
| /// Root path to the `.crate` files. |
| crate_dir: PathBuf, |
| /// Root path to the `src` directories. |
| src: PathBuf, |
| } |
| |
| /// Migrations which initialize the database, and can be used to evolve it over time. |
| /// |
| /// See [`Migration`] for more detail. |
| /// |
| /// **Be sure to not change the order or entries here!** |
| fn migrations() -> Vec<Migration> { |
| vec![ |
| // registry_index tracks the overall usage of an index cache, and tracks a |
| // numeric ID to refer to that index that is used in other tables. |
| basic_migration( |
| "CREATE TABLE registry_index ( |
| id INTEGER PRIMARY KEY AUTOINCREMENT, |
| name TEXT UNIQUE NOT NULL, |
| timestamp INTEGER NOT NULL |
| )", |
| ), |
| // .crate files |
| basic_migration( |
| "CREATE TABLE registry_crate ( |
| registry_id INTEGER NOT NULL, |
| name TEXT NOT NULL, |
| size INTEGER NOT NULL, |
| timestamp INTEGER NOT NULL, |
| PRIMARY KEY (registry_id, name), |
| FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE |
| )", |
| ), |
| // Extracted src directories |
| // |
| // Note that `size` can be NULL. This will happen when marking a src |
| // directory as used that was created by an older version of cargo |
| // that didn't do size tracking. |
| basic_migration( |
| "CREATE TABLE registry_src ( |
| registry_id INTEGER NOT NULL, |
| name TEXT NOT NULL, |
| size INTEGER, |
| timestamp INTEGER NOT NULL, |
| PRIMARY KEY (registry_id, name), |
| FOREIGN KEY (registry_id) REFERENCES registry_index (id) ON DELETE CASCADE |
| )", |
| ), |
| // Git db directories |
| basic_migration( |
| "CREATE TABLE git_db ( |
| id INTEGER PRIMARY KEY AUTOINCREMENT, |
| name TEXT UNIQUE NOT NULL, |
| timestamp INTEGER NOT NULL |
| )", |
| ), |
| // Git checkout directories |
| basic_migration( |
| "CREATE TABLE git_checkout ( |
| git_id INTEGER NOT NULL, |
| name TEXT UNIQUE NOT NULL, |
| size INTEGER, |
| timestamp INTEGER NOT NULL, |
| PRIMARY KEY (git_id, name), |
| FOREIGN KEY (git_id) REFERENCES git_db (id) ON DELETE CASCADE |
| )", |
| ), |
| // This is a general-purpose single-row table that can store arbitrary |
| // data. Feel free to add columns (with ALTER TABLE) if necessary. |
| basic_migration( |
| "CREATE TABLE global_data ( |
| last_auto_gc INTEGER NOT NULL |
| )", |
| ), |
| // last_auto_gc tracks the last time auto-gc was run (so that it only |
| // runs roughly once a day for performance reasons). Prime it with the |
| // current time to establish a baseline. |
| Box::new(|conn| { |
| conn.execute( |
| "INSERT INTO global_data (last_auto_gc) VALUES (?1)", |
| [now()], |
| )?; |
| Ok(()) |
| }), |
| ] |
| } |
| |
| /// Type for SQL columns that refer to the primary key of their parent table. |
| /// |
| /// For example, `registry_crate.registry_id` refers to its parent `registry_index.id`. |
| #[derive(Copy, Clone, Debug, PartialEq)] |
| struct ParentId(i64); |
| |
| impl rusqlite::types::FromSql for ParentId { |
| fn column_result(value: rusqlite::types::ValueRef<'_>) -> rusqlite::types::FromSqlResult<Self> { |
| let i = i64::column_result(value)?; |
| Ok(ParentId(i)) |
| } |
| } |
| |
| impl rusqlite::types::ToSql for ParentId { |
| fn to_sql(&self) -> rusqlite::Result<rusqlite::types::ToSqlOutput<'_>> { |
| Ok(rusqlite::types::ToSqlOutput::from(self.0)) |
| } |
| } |
| |
| /// Tracking for the global shared cache (registry files, etc.). |
| /// |
| /// This is the interface to the global cache database, used for tracking and |
| /// cleaning. See the [`crate::core::global_cache_tracker`] module docs for |
| /// details. |
| #[derive(Debug)] |
| pub struct GlobalCacheTracker { |
| /// Connection to the SQLite database. |
| conn: Connection, |
| /// This is an optimization used to make sure cargo only checks if gc |
| /// needs to run once per session. This starts as `false`, and then the |
| /// first time it checks if automatic gc needs to run, it will be set to |
| /// `true`. |
| auto_gc_checked_this_session: bool, |
| } |
| |
| impl GlobalCacheTracker { |
| /// Creates a new [`GlobalCacheTracker`]. |
| /// |
| /// The caller is responsible for locking the package cache with |
| /// [`CacheLockMode::DownloadExclusive`] before calling this. |
| pub fn new(gctx: &GlobalContext) -> CargoResult<GlobalCacheTracker> { |
| let db_path = Self::db_path(gctx); |
| // A package cache lock is required to ensure only one cargo is |
| // accessing at the same time. If there is concurrent access, we |
| // want to rely on cargo's own "Blocking" system (which can |
| // provide user feedback) rather than blocking inside sqlite |
| // (which by default has a short timeout). |
| let db_path = gctx.assert_package_cache_locked(CacheLockMode::DownloadExclusive, &db_path); |
| let mut conn = Connection::open(db_path)?; |
| conn.pragma_update(None, "foreign_keys", true)?; |
| sqlite::migrate(&mut conn, &migrations())?; |
| Ok(GlobalCacheTracker { |
| conn, |
| auto_gc_checked_this_session: false, |
| }) |
| } |
| |
| /// The path to the database. |
| pub fn db_path(gctx: &GlobalContext) -> Filesystem { |
| gctx.home().join(GLOBAL_CACHE_FILENAME) |
| } |
| |
| /// Given an encoded registry name, returns its ID. |
| /// |
| /// Returns None if the given name isn't in the database. |
| fn id_from_name( |
| conn: &Connection, |
| table_name: &str, |
| encoded_name: &str, |
| ) -> CargoResult<Option<ParentId>> { |
| let mut stmt = |
| conn.prepare_cached(&format!("SELECT id FROM {table_name} WHERE name = ?"))?; |
| match stmt.query_row([encoded_name], |row| row.get(0)) { |
| Ok(id) => Ok(Some(id)), |
| Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None), |
| Err(e) => Err(e.into()), |
| } |
| } |
| |
| /// Returns a map of ID to path for the given ids in the given table. |
| /// |
| /// For example, given `registry_index` IDs, it returns filenames of the |
| /// form "index.crates.io-6f17d22bba15001f". |
| fn get_id_map( |
| conn: &Connection, |
| table_name: &str, |
| ids: &[i64], |
| ) -> CargoResult<HashMap<i64, PathBuf>> { |
| let mut stmt = |
| conn.prepare_cached(&format!("SELECT name FROM {table_name} WHERE id = ?1"))?; |
| ids.iter() |
| .map(|id| { |
| let name = stmt.query_row(params![id], |row| { |
| Ok(PathBuf::from(row.get::<_, String>(0)?)) |
| })?; |
| Ok((*id, name)) |
| }) |
| .collect() |
| } |
| |
| /// Returns all index cache timestamps. |
| pub fn registry_index_all(&self) -> CargoResult<Vec<(RegistryIndex, Timestamp)>> { |
| let mut stmt = self |
| .conn |
| .prepare_cached("SELECT name, timestamp FROM registry_index")?; |
| let rows = stmt |
| .query_map([], |row| { |
| let encoded_registry_name = row.get_unwrap(0); |
| let timestamp = row.get_unwrap(1); |
| let kind = RegistryIndex { |
| encoded_registry_name, |
| }; |
| Ok((kind, timestamp)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| Ok(rows) |
| } |
| |
| /// Returns all registry crate cache timestamps. |
| pub fn registry_crate_all(&self) -> CargoResult<Vec<(RegistryCrate, Timestamp)>> { |
| let mut stmt = self.conn.prepare_cached( |
| "SELECT registry_index.name, registry_crate.name, registry_crate.size, registry_crate.timestamp |
| FROM registry_index, registry_crate |
| WHERE registry_crate.registry_id = registry_index.id", |
| )?; |
| let rows = stmt |
| .query_map([], |row| { |
| let encoded_registry_name = row.get_unwrap(0); |
| let crate_filename = row.get_unwrap(1); |
| let size = row.get_unwrap(2); |
| let timestamp = row.get_unwrap(3); |
| let kind = RegistryCrate { |
| encoded_registry_name, |
| crate_filename, |
| size, |
| }; |
| Ok((kind, timestamp)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| Ok(rows) |
| } |
| |
| /// Returns all registry source cache timestamps. |
| pub fn registry_src_all(&self) -> CargoResult<Vec<(RegistrySrc, Timestamp)>> { |
| let mut stmt = self.conn.prepare_cached( |
| "SELECT registry_index.name, registry_src.name, registry_src.size, registry_src.timestamp |
| FROM registry_index, registry_src |
| WHERE registry_src.registry_id = registry_index.id", |
| )?; |
| let rows = stmt |
| .query_map([], |row| { |
| let encoded_registry_name = row.get_unwrap(0); |
| let package_dir = row.get_unwrap(1); |
| let size = row.get_unwrap(2); |
| let timestamp = row.get_unwrap(3); |
| let kind = RegistrySrc { |
| encoded_registry_name, |
| package_dir, |
| size, |
| }; |
| Ok((kind, timestamp)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| Ok(rows) |
| } |
| |
| /// Returns all git db timestamps. |
| pub fn git_db_all(&self) -> CargoResult<Vec<(GitDb, Timestamp)>> { |
| let mut stmt = self |
| .conn |
| .prepare_cached("SELECT name, timestamp FROM git_db")?; |
| let rows = stmt |
| .query_map([], |row| { |
| let encoded_git_name = row.get_unwrap(0); |
| let timestamp = row.get_unwrap(1); |
| let kind = GitDb { encoded_git_name }; |
| Ok((kind, timestamp)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| Ok(rows) |
| } |
| |
| /// Returns all git checkout timestamps. |
| pub fn git_checkout_all(&self) -> CargoResult<Vec<(GitCheckout, Timestamp)>> { |
| let mut stmt = self.conn.prepare_cached( |
| "SELECT git_db.name, git_checkout.name, git_checkout.size, git_checkout.timestamp |
| FROM git_db, git_checkout |
| WHERE git_checkout.git_id = git_db.id", |
| )?; |
| let rows = stmt |
| .query_map([], |row| { |
| let encoded_git_name = row.get_unwrap(0); |
| let short_name = row.get_unwrap(1); |
| let size = row.get_unwrap(2); |
| let timestamp = row.get_unwrap(3); |
| let kind = GitCheckout { |
| encoded_git_name, |
| short_name, |
| size, |
| }; |
| Ok((kind, timestamp)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| Ok(rows) |
| } |
| |
| /// Returns whether or not an auto GC should be performed, compared to the |
| /// last time it was recorded in the database. |
| pub fn should_run_auto_gc(&mut self, frequency: Duration) -> CargoResult<bool> { |
| trace!(target: "gc", "should_run_auto_gc"); |
| if self.auto_gc_checked_this_session { |
| return Ok(false); |
| } |
| let last_auto_gc: Timestamp = |
| self.conn |
| .query_row("SELECT last_auto_gc FROM global_data", [], |row| row.get(0))?; |
| let should_run = last_auto_gc + frequency.as_secs() < now(); |
| trace!(target: "gc", |
| "last auto gc was {}, {}", |
| last_auto_gc, |
| if should_run { "running" } else { "skipping" } |
| ); |
| self.auto_gc_checked_this_session = true; |
| Ok(should_run) |
| } |
| |
| /// Writes to the database to indicate that an automatic GC has just been |
| /// completed. |
| pub fn set_last_auto_gc(&self) -> CargoResult<()> { |
| self.conn |
| .execute("UPDATE global_data SET last_auto_gc = ?1", [now()])?; |
| Ok(()) |
| } |
| |
| /// Deletes files from the global cache based on the given options. |
| pub fn clean(&mut self, clean_ctx: &mut CleanContext<'_>, gc_opts: &GcOpts) -> CargoResult<()> { |
| self.clean_inner(clean_ctx, gc_opts) |
| .with_context(|| "failed to clean entries from the global cache") |
| } |
| |
| #[tracing::instrument(skip_all)] |
| fn clean_inner( |
| &mut self, |
| clean_ctx: &mut CleanContext<'_>, |
| gc_opts: &GcOpts, |
| ) -> CargoResult<()> { |
| let gctx = clean_ctx.gctx; |
| let base = BasePaths { |
| index: gctx.registry_index_path().into_path_unlocked(), |
| git_db: gctx.git_db_path().into_path_unlocked(), |
| git_co: gctx.git_checkouts_path().into_path_unlocked(), |
| crate_dir: gctx.registry_cache_path().into_path_unlocked(), |
| src: gctx.registry_source_path().into_path_unlocked(), |
| }; |
| let now = now(); |
| trace!(target: "gc", "cleaning {gc_opts:?}"); |
| let tx = self.conn.transaction()?; |
| let mut delete_paths = Vec::new(); |
| // This can be an expensive operation, so only perform it if necessary. |
| if gc_opts.is_download_cache_opt_set() { |
| // TODO: Investigate how slow this might be. |
| Self::sync_db_with_files( |
| &tx, |
| now, |
| gctx, |
| &base, |
| gc_opts.is_download_cache_size_set(), |
| &mut delete_paths, |
| ) |
| .with_context(|| "failed to sync tracking database")? |
| } |
| if let Some(max_age) = gc_opts.max_index_age { |
| let max_age = now - max_age.as_secs(); |
| Self::get_registry_index_to_clean(&tx, max_age, &base, &mut delete_paths)?; |
| } |
| if let Some(max_age) = gc_opts.max_src_age { |
| let max_age = now - max_age.as_secs(); |
| Self::get_registry_items_to_clean_age( |
| &tx, |
| max_age, |
| REGISTRY_SRC_TABLE, |
| &base.src, |
| &mut delete_paths, |
| )?; |
| } |
| if let Some(max_age) = gc_opts.max_crate_age { |
| let max_age = now - max_age.as_secs(); |
| Self::get_registry_items_to_clean_age( |
| &tx, |
| max_age, |
| REGISTRY_CRATE_TABLE, |
| &base.crate_dir, |
| &mut delete_paths, |
| )?; |
| } |
| if let Some(max_age) = gc_opts.max_git_db_age { |
| let max_age = now - max_age.as_secs(); |
| Self::get_git_db_items_to_clean(&tx, max_age, &base, &mut delete_paths)?; |
| } |
| if let Some(max_age) = gc_opts.max_git_co_age { |
| let max_age = now - max_age.as_secs(); |
| Self::get_git_co_items_to_clean(&tx, max_age, &base.git_co, &mut delete_paths)?; |
| } |
| // Size collection must happen after date collection so that dates |
| // have precedence, since size constraints are a more blunt |
| // instrument. |
| // |
| // These are also complicated by the `--max-download-size` option |
| // overlapping with `--max-crate-size` and `--max-src-size`, which |
| // requires some coordination between those options which isn't |
| // necessary with the age-based options. An item's age is either older |
| // or it isn't, but contrast that with size which is based on the sum |
| // of all tracked items. Also, `--max-download-size` is summed against |
| // both the crate and src tracking, which requires combining them to |
| // compute the size, and then separating them to calculate the correct |
| // paths. |
| if let Some(max_size) = gc_opts.max_crate_size { |
| Self::get_registry_items_to_clean_size( |
| &tx, |
| max_size, |
| REGISTRY_CRATE_TABLE, |
| &base.crate_dir, |
| &mut delete_paths, |
| )?; |
| } |
| if let Some(max_size) = gc_opts.max_src_size { |
| Self::get_registry_items_to_clean_size( |
| &tx, |
| max_size, |
| REGISTRY_SRC_TABLE, |
| &base.src, |
| &mut delete_paths, |
| )?; |
| } |
| if let Some(max_size) = gc_opts.max_git_size { |
| Self::get_git_items_to_clean_size(&tx, max_size, &base, &mut delete_paths)?; |
| } |
| if let Some(max_size) = gc_opts.max_download_size { |
| Self::get_registry_items_to_clean_size_both(&tx, max_size, &base, &mut delete_paths)?; |
| } |
| |
| clean_ctx.remove_paths(&delete_paths)?; |
| |
| if clean_ctx.dry_run { |
| tx.rollback()?; |
| } else { |
| tx.commit()?; |
| } |
| Ok(()) |
| } |
| |
| /// Returns a list of directory entries in the given path. |
| fn names_from(path: &Path) -> CargoResult<Vec<String>> { |
| let entries = match path.read_dir() { |
| Ok(e) => e, |
| Err(e) => { |
| if e.kind() == std::io::ErrorKind::NotFound { |
| return Ok(Vec::new()); |
| } else { |
| return Err( |
| anyhow::Error::new(e).context(format!("failed to read path `{path:?}`")) |
| ); |
| } |
| } |
| }; |
| let names = entries |
| .filter_map(|entry| entry.ok()?.file_name().into_string().ok()) |
| .collect(); |
| Ok(names) |
| } |
| |
| /// Synchronizes the database to match the files on disk. |
| /// |
| /// This performs the following cleanups: |
| /// |
| /// 1. Remove entries from the database that are missing on disk. |
| /// 2. Adds missing entries to the database that are on disk (such as when |
| /// files are added by older versions of cargo). |
| /// 3. Fills in the `size` column where it is NULL (such as when something |
| /// is added to disk by an older version of cargo, and one of the mark |
| /// functions marked it without knowing the size). |
| /// |
| /// Size computations are only done if `sync_size` is set since it can |
| /// be a very expensive operation. This should only be set if the user |
| /// requested to clean based on the cache size. |
| /// 4. Checks for orphaned files. For example, if there are `.crate` files |
| /// associated with an index that does not exist. |
| /// |
| /// These orphaned files will be added to `delete_paths` so that the |
| /// caller can delete them. |
| #[tracing::instrument(skip(conn, gctx, base, delete_paths))] |
| fn sync_db_with_files( |
| conn: &Connection, |
| now: Timestamp, |
| gctx: &GlobalContext, |
| base: &BasePaths, |
| sync_size: bool, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "starting db sync"); |
| // For registry_index and git_db, add anything that is missing in the db. |
| Self::update_parent_for_missing_from_db(conn, now, REGISTRY_INDEX_TABLE, &base.index)?; |
| Self::update_parent_for_missing_from_db(conn, now, GIT_DB_TABLE, &base.git_db)?; |
| |
| // For registry_crate, registry_src, and git_checkout, remove anything |
| // from the db that isn't on disk. |
| Self::update_db_for_removed( |
| conn, |
| REGISTRY_INDEX_TABLE, |
| "registry_id", |
| REGISTRY_CRATE_TABLE, |
| &base.crate_dir, |
| )?; |
| Self::update_db_for_removed( |
| conn, |
| REGISTRY_INDEX_TABLE, |
| "registry_id", |
| REGISTRY_SRC_TABLE, |
| &base.src, |
| )?; |
| Self::update_db_for_removed(conn, GIT_DB_TABLE, "git_id", GIT_CO_TABLE, &base.git_co)?; |
| |
| // For registry_index and git_db, remove anything from the db that |
| // isn't on disk. |
| // |
| // This also collects paths for any child files that don't have their |
| // respective parent on disk. |
| Self::update_db_parent_for_removed_from_disk( |
| conn, |
| REGISTRY_INDEX_TABLE, |
| &base.index, |
| &[&base.crate_dir, &base.src], |
| delete_paths, |
| )?; |
| Self::update_db_parent_for_removed_from_disk( |
| conn, |
| GIT_DB_TABLE, |
| &base.git_db, |
| &[&base.git_co], |
| delete_paths, |
| )?; |
| |
| // For registry_crate, registry_src, and git_checkout, add anything |
| // that is missing in the db. |
| Self::populate_untracked_crate(conn, now, &base.crate_dir)?; |
| Self::populate_untracked( |
| conn, |
| now, |
| gctx, |
| REGISTRY_INDEX_TABLE, |
| "registry_id", |
| REGISTRY_SRC_TABLE, |
| &base.src, |
| sync_size, |
| )?; |
| Self::populate_untracked( |
| conn, |
| now, |
| gctx, |
| GIT_DB_TABLE, |
| "git_id", |
| GIT_CO_TABLE, |
| &base.git_co, |
| sync_size, |
| )?; |
| |
| // Update any NULL sizes if needed. |
| if sync_size { |
| Self::update_null_sizes( |
| conn, |
| gctx, |
| REGISTRY_INDEX_TABLE, |
| "registry_id", |
| REGISTRY_SRC_TABLE, |
| &base.src, |
| )?; |
| Self::update_null_sizes( |
| conn, |
| gctx, |
| GIT_DB_TABLE, |
| "git_id", |
| GIT_CO_TABLE, |
| &base.git_co, |
| )?; |
| } |
| Ok(()) |
| } |
| |
| /// For parent tables, add any entries that are on disk but aren't tracked in the db. |
| #[tracing::instrument(skip(conn, now, base_path))] |
| fn update_parent_for_missing_from_db( |
| conn: &Connection, |
| now: Timestamp, |
| parent_table_name: &str, |
| base_path: &Path, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "checking for untracked parent to add to {parent_table_name}"); |
| let names = Self::names_from(base_path)?; |
| |
| let mut stmt = conn.prepare_cached(&format!( |
| "INSERT INTO {parent_table_name} (name, timestamp) |
| VALUES (?1, ?2) |
| ON CONFLICT DO NOTHING", |
| ))?; |
| for name in names { |
| stmt.execute(params![name, now])?; |
| } |
| Ok(()) |
| } |
| |
| /// Removes database entries for any files that are not on disk for the child tables. |
| /// |
| /// This could happen for example if the user manually deleted the file or |
| /// any such scenario where the filesystem and db are out of sync. |
| #[tracing::instrument(skip(conn, base_path))] |
| fn update_db_for_removed( |
| conn: &Connection, |
| parent_table_name: &str, |
| id_column_name: &str, |
| table_name: &str, |
| base_path: &Path, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "checking for db entries to remove from {table_name}"); |
| let mut select_stmt = conn.prepare_cached(&format!( |
| "SELECT {table_name}.rowid, {parent_table_name}.name, {table_name}.name |
| FROM {parent_table_name}, {table_name} |
| WHERE {table_name}.{id_column_name} = {parent_table_name}.id", |
| ))?; |
| let mut delete_stmt = |
| conn.prepare_cached(&format!("DELETE FROM {table_name} WHERE rowid = ?1"))?; |
| let mut rows = select_stmt.query([])?; |
| while let Some(row) = rows.next()? { |
| let rowid: i64 = row.get_unwrap(0); |
| let id_name: String = row.get_unwrap(1); |
| let name: String = row.get_unwrap(2); |
| if !base_path.join(id_name).join(name).exists() { |
| delete_stmt.execute([rowid])?; |
| } |
| } |
| Ok(()) |
| } |
| |
| /// Removes database entries for any files that are not on disk for the parent tables. |
| #[tracing::instrument(skip(conn, base_path, child_base_paths, delete_paths))] |
| fn update_db_parent_for_removed_from_disk( |
| conn: &Connection, |
| parent_table_name: &str, |
| base_path: &Path, |
| child_base_paths: &[&Path], |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "checking for db entries to remove from {parent_table_name}"); |
| let mut select_stmt = |
| conn.prepare_cached(&format!("SELECT rowid, name FROM {parent_table_name}"))?; |
| let mut delete_stmt = |
| conn.prepare_cached(&format!("DELETE FROM {parent_table_name} WHERE rowid = ?1"))?; |
| let mut rows = select_stmt.query([])?; |
| while let Some(row) = rows.next()? { |
| let rowid: i64 = row.get_unwrap(0); |
| let id_name: String = row.get_unwrap(1); |
| if !base_path.join(&id_name).exists() { |
| delete_stmt.execute([rowid])?; |
| // Make sure any child data is also cleaned up. |
| for child_base in child_base_paths { |
| let child_path = child_base.join(&id_name); |
| if child_path.exists() { |
| debug!(target: "gc", "removing orphaned path {child_path:?}"); |
| delete_paths.push(child_path); |
| } |
| } |
| } |
| } |
| Ok(()) |
| } |
| |
| /// Updates the database to add any `.crate` files that are currently |
| /// not tracked (such as when they are downloaded by an older version of |
| /// cargo). |
| #[tracing::instrument(skip(conn, now, base_path))] |
| fn populate_untracked_crate( |
| conn: &Connection, |
| now: Timestamp, |
| base_path: &Path, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "populating untracked crate files"); |
| let mut insert_stmt = conn.prepare_cached( |
| "INSERT INTO registry_crate (registry_id, name, size, timestamp) |
| VALUES (?1, ?2, ?3, ?4) |
| ON CONFLICT DO NOTHING", |
| )?; |
| let index_names = Self::names_from(&base_path)?; |
| for index_name in index_names { |
| let Some(id) = Self::id_from_name(conn, REGISTRY_INDEX_TABLE, &index_name)? else { |
| // The id is missing from the database. This should be resolved |
| // via update_db_parent_for_removed_from_disk. |
| continue; |
| }; |
| let index_path = base_path.join(index_name); |
| for crate_name in Self::names_from(&index_path)? { |
| if crate_name.ends_with(".crate") { |
| // Missing files should have already been taken care of by |
| // update_db_for_removed. |
| let size = paths::metadata(index_path.join(&crate_name))?.len(); |
| insert_stmt.execute(params![id, crate_name, size, now])?; |
| } |
| } |
| } |
| Ok(()) |
| } |
| |
| /// Updates the database to add any files that are currently not tracked |
| /// (such as when they are downloaded by an older version of cargo). |
| #[tracing::instrument(skip(conn, now, gctx, base_path, populate_size))] |
| fn populate_untracked( |
| conn: &Connection, |
| now: Timestamp, |
| gctx: &GlobalContext, |
| id_table_name: &str, |
| id_column_name: &str, |
| table_name: &str, |
| base_path: &Path, |
| populate_size: bool, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "populating untracked files for {table_name}"); |
| // Gather names (and make sure they are in the database). |
| let id_names = Self::names_from(&base_path)?; |
| |
| // This SELECT is used to determine if the directory is already |
| // tracked. We don't want to do the expensive size computation unless |
| // necessary. |
| let mut select_stmt = conn.prepare_cached(&format!( |
| "SELECT 1 FROM {table_name} |
| WHERE {id_column_name} = ?1 AND name = ?2", |
| ))?; |
| let mut insert_stmt = conn.prepare_cached(&format!( |
| "INSERT INTO {table_name} ({id_column_name}, name, size, timestamp) |
| VALUES (?1, ?2, ?3, ?4) |
| ON CONFLICT DO NOTHING", |
| ))?; |
| let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx); |
| // Compute the size of any directory not in the database. |
| for id_name in id_names { |
| let Some(id) = Self::id_from_name(conn, id_table_name, &id_name)? else { |
| // The id is missing from the database. This should be resolved |
| // via update_db_parent_for_removed_from_disk. |
| continue; |
| }; |
| let index_path = base_path.join(id_name); |
| let names = Self::names_from(&index_path)?; |
| let max = names.len(); |
| for (i, name) in names.iter().enumerate() { |
| if select_stmt.exists(params![id, name])? { |
| continue; |
| } |
| let dir_path = index_path.join(name); |
| if !dir_path.is_dir() { |
| continue; |
| } |
| progress.tick(i, max, "")?; |
| let size = if populate_size { |
| Some(du(&dir_path, table_name)?) |
| } else { |
| None |
| }; |
| insert_stmt.execute(params![id, name, size, now])?; |
| } |
| } |
| Ok(()) |
| } |
| |
| /// Fills in the `size` column where it is NULL. |
| /// |
| /// This can happen when something is added to disk by an older version of |
| /// cargo, and one of the mark functions marked it without knowing the |
| /// size. |
| /// |
| /// `update_db_for_removed` should be called before this is called. |
| #[tracing::instrument(skip(conn, gctx, base_path))] |
| fn update_null_sizes( |
| conn: &Connection, |
| gctx: &GlobalContext, |
| parent_table_name: &str, |
| id_column_name: &str, |
| table_name: &str, |
| base_path: &Path, |
| ) -> CargoResult<()> { |
| trace!(target: "gc", "updating NULL size information in {table_name}"); |
| let mut null_stmt = conn.prepare_cached(&format!( |
| "SELECT {table_name}.rowid, {table_name}.name, {parent_table_name}.name |
| FROM {table_name}, {parent_table_name} |
| WHERE {table_name}.size IS NULL AND {table_name}.{id_column_name} = {parent_table_name}.id", |
| ))?; |
| let mut update_stmt = conn.prepare_cached(&format!( |
| "UPDATE {table_name} SET size = ?1 WHERE rowid = ?2" |
| ))?; |
| let mut progress = Progress::with_style("Scanning", ProgressStyle::Ratio, gctx); |
| let rows: Vec<_> = null_stmt |
| .query_map([], |row| { |
| Ok((row.get_unwrap(0), row.get_unwrap(1), row.get_unwrap(2))) |
| })? |
| .collect(); |
| let max = rows.len(); |
| for (i, row) in rows.into_iter().enumerate() { |
| let (rowid, name, id_name): (i64, String, String) = row?; |
| let path = base_path.join(id_name).join(name); |
| progress.tick(i, max, "")?; |
| // Missing files should have already been taken care of by |
| // update_db_for_removed. |
| let size = du(&path, table_name)?; |
| update_stmt.execute(params![size, rowid])?; |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from either registry_crate or registry_src whose |
| /// last use is older than the given timestamp. |
| fn get_registry_items_to_clean_age( |
| conn: &Connection, |
| max_age: Timestamp, |
| table_name: &str, |
| base_path: &Path, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning {table_name} since {max_age:?}"); |
| let mut stmt = conn.prepare_cached(&format!( |
| "DELETE FROM {table_name} WHERE timestamp < ?1 |
| RETURNING registry_id, name" |
| ))?; |
| let rows = stmt |
| .query_map(params![max_age], |row| { |
| let registry_id = row.get_unwrap(0); |
| let name: String = row.get_unwrap(1); |
| Ok((registry_id, name)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| let ids: Vec<_> = rows.iter().map(|r| r.0).collect(); |
| let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?; |
| for (id, name) in rows { |
| let encoded_registry_name = &id_map[&id]; |
| delete_paths.push(base_path.join(encoded_registry_name).join(name)); |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from either `registry_crate` or `registry_src` in |
| /// order to keep the total size under the given max size. |
| fn get_registry_items_to_clean_size( |
| conn: &Connection, |
| max_size: u64, |
| table_name: &str, |
| base_path: &Path, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning {table_name} till under {max_size:?}"); |
| let total_size: u64 = conn.query_row( |
| &format!("SELECT coalesce(SUM(size), 0) FROM {table_name}"), |
| [], |
| |row| row.get(0), |
| )?; |
| if total_size <= max_size { |
| return Ok(()); |
| } |
| // This SQL statement selects all of the rows ordered by timestamp, |
| // and then uses a window function to keep a running total of the |
| // size. It selects all rows until the running total exceeds the |
| // threshold of the total number of bytes that we want to delete. |
| // |
| // The window function essentially computes an aggregate over all |
| // previous rows as it goes along. As long as the running size is |
| // below the total amount that we need to delete, it keeps picking |
| // more rows. |
| // |
| // The ORDER BY includes `name` mainly for test purposes so that |
| // entries with the same timestamp have deterministic behavior. |
| // |
| // The coalesce helps convert NULL to 0. |
| let mut stmt = conn.prepare(&format!( |
| "DELETE FROM {table_name} WHERE rowid IN \ |
| (SELECT x.rowid FROM \ |
| (SELECT rowid, size, SUM(size) OVER \ |
| (ORDER BY timestamp, name ROWS UNBOUNDED PRECEDING) AS running_amount \ |
| FROM {table_name}) x \ |
| WHERE coalesce(x.running_amount, 0) - x.size < ?1) \ |
| RETURNING registry_id, name;" |
| ))?; |
| let rows = stmt |
| .query_map(params![total_size - max_size], |row| { |
| let id = row.get_unwrap(0); |
| let name: String = row.get_unwrap(1); |
| Ok((id, name)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| // Convert registry_id to the encoded registry name, and join those. |
| let ids: Vec<_> = rows.iter().map(|r| r.0).collect(); |
| let id_map = Self::get_id_map(conn, REGISTRY_INDEX_TABLE, &ids)?; |
| for (id, name) in rows { |
| let encoded_name = &id_map[&id]; |
| delete_paths.push(base_path.join(encoded_name).join(name)); |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from both `registry_crate` and `registry_src` in |
| /// order to keep the total size under the given max size. |
| fn get_registry_items_to_clean_size_both( |
| conn: &Connection, |
| max_size: u64, |
| base: &BasePaths, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning download till under {max_size:?}"); |
| |
| // This SQL statement selects from both registry_src and |
| // registry_crate so that sorting of timestamps incorporates both of |
| // them at the same time. It uses a const value of 1 or 2 as the first |
| // column so that the code below can determine which table the value |
| // came from. |
| let mut stmt = conn.prepare_cached( |
| "SELECT 1, registry_src.rowid, registry_src.name AS name, registry_index.name, |
| registry_src.size, registry_src.timestamp AS timestamp |
| FROM registry_src, registry_index |
| WHERE registry_src.registry_id = registry_index.id AND registry_src.size NOT NULL |
| |
| UNION |
| |
| SELECT 2, registry_crate.rowid, registry_crate.name AS name, registry_index.name, |
| registry_crate.size, registry_crate.timestamp AS timestamp |
| FROM registry_crate, registry_index |
| WHERE registry_crate.registry_id = registry_index.id |
| |
| ORDER BY timestamp, name", |
| )?; |
| let mut delete_src_stmt = |
| conn.prepare_cached("DELETE FROM registry_src WHERE rowid = ?1")?; |
| let mut delete_crate_stmt = |
| conn.prepare_cached("DELETE FROM registry_crate WHERE rowid = ?1")?; |
| let rows = stmt |
| .query_map([], |row| { |
| Ok(( |
| row.get_unwrap(0), |
| row.get_unwrap(1), |
| row.get_unwrap(2), |
| row.get_unwrap(3), |
| row.get_unwrap(4), |
| )) |
| })? |
| .collect::<Result<Vec<(i64, i64, String, String, u64)>, _>>()?; |
| let mut total_size: u64 = rows.iter().map(|r| r.4).sum(); |
| debug!(target: "gc", "total download cache size appears to be {total_size}"); |
| for (table, rowid, name, index_name, size) in rows { |
| if total_size <= max_size { |
| break; |
| } |
| if table == 1 { |
| delete_paths.push(base.src.join(index_name).join(name)); |
| delete_src_stmt.execute([rowid])?; |
| } else { |
| delete_paths.push(base.crate_dir.join(index_name).join(name)); |
| delete_crate_stmt.execute([rowid])?; |
| } |
| // TODO: If delete crate, ensure src is also deleted. |
| total_size -= size; |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from the git cache, keeping the total size under |
| /// the give value. |
| /// |
| /// Paths are relative to the `git` directory in the cache directory. |
| fn get_git_items_to_clean_size( |
| conn: &Connection, |
| max_size: u64, |
| base: &BasePaths, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning git till under {max_size:?}"); |
| |
| // Collect all the sizes from git_db and git_checkouts, and then sort them by timestamp. |
| let mut stmt = conn.prepare_cached("SELECT rowid, name, timestamp FROM git_db")?; |
| let mut git_info = stmt |
| .query_map([], |row| { |
| let rowid: i64 = row.get_unwrap(0); |
| let name: String = row.get_unwrap(1); |
| let timestamp: Timestamp = row.get_unwrap(2); |
| // Size is added below so that the error doesn't need to be |
| // converted to a rusqlite error. |
| Ok((timestamp, rowid, None, name, 0)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| for info in &mut git_info { |
| let size = cargo_util::du(&base.git_db.join(&info.3), &[])?; |
| info.4 = size; |
| } |
| |
| let mut stmt = conn.prepare_cached( |
| "SELECT git_checkout.rowid, git_db.name, git_checkout.name, |
| git_checkout.size, git_checkout.timestamp |
| FROM git_checkout, git_db |
| WHERE git_checkout.git_id = git_db.id AND git_checkout.size NOT NULL", |
| )?; |
| let git_co_rows = stmt |
| .query_map([], |row| { |
| let rowid = row.get_unwrap(0); |
| let db_name: String = row.get_unwrap(1); |
| let name = row.get_unwrap(2); |
| let size = row.get_unwrap(3); |
| let timestamp = row.get_unwrap(4); |
| Ok((timestamp, rowid, Some(db_name), name, size)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| git_info.extend(git_co_rows); |
| |
| // Sort by timestamp, and name. The name is included mostly for test |
| // purposes so that entries with the same timestamp have deterministic |
| // behavior. |
| git_info.sort_by(|a, b| (b.0, &b.3).cmp(&(a.0, &a.3))); |
| |
| // Collect paths to delete. |
| let mut delete_db_stmt = conn.prepare_cached("DELETE FROM git_db WHERE rowid = ?1")?; |
| let mut delete_co_stmt = |
| conn.prepare_cached("DELETE FROM git_checkout WHERE rowid = ?1")?; |
| let mut total_size: u64 = git_info.iter().map(|r| r.4).sum(); |
| debug!(target: "gc", "total git cache size appears to be {total_size}"); |
| while let Some((_timestamp, rowid, db_name, name, size)) = git_info.pop() { |
| if total_size <= max_size { |
| break; |
| } |
| if let Some(db_name) = db_name { |
| delete_paths.push(base.git_co.join(db_name).join(name)); |
| delete_co_stmt.execute([rowid])?; |
| total_size -= size; |
| } else { |
| total_size -= size; |
| delete_paths.push(base.git_db.join(&name)); |
| delete_db_stmt.execute([rowid])?; |
| // If the db is deleted, then all the checkouts must be deleted. |
| let mut i = 0; |
| while i < git_info.len() { |
| if git_info[i].2.as_deref() == Some(name.as_ref()) { |
| let (_, rowid, db_name, name, size) = git_info.remove(i); |
| delete_paths.push(base.git_co.join(db_name.unwrap()).join(name)); |
| delete_co_stmt.execute([rowid])?; |
| total_size -= size; |
| } else { |
| i += 1; |
| } |
| } |
| } |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from `registry_index` whose last use is older |
| /// than the given timestamp. |
| fn get_registry_index_to_clean( |
| conn: &Connection, |
| max_age: Timestamp, |
| base: &BasePaths, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning index since {max_age:?}"); |
| let mut stmt = conn.prepare_cached( |
| "DELETE FROM registry_index WHERE timestamp < ?1 |
| RETURNING name", |
| )?; |
| let mut rows = stmt.query([max_age])?; |
| while let Some(row) = rows.next()? { |
| let name: String = row.get_unwrap(0); |
| delete_paths.push(base.index.join(&name)); |
| // Also delete .crate and src directories, since by definition |
| // they cannot be used without their index. |
| delete_paths.push(base.src.join(&name)); |
| delete_paths.push(base.crate_dir.join(&name)); |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from `git_checkout` whose last use is |
| /// older than the given timestamp. |
| fn get_git_co_items_to_clean( |
| conn: &Connection, |
| max_age: Timestamp, |
| base_path: &Path, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning git co since {max_age:?}"); |
| let mut stmt = conn.prepare_cached( |
| "DELETE FROM git_checkout WHERE timestamp < ?1 |
| RETURNING git_id, name", |
| )?; |
| let rows = stmt |
| .query_map(params![max_age], |row| { |
| let git_id = row.get_unwrap(0); |
| let name: String = row.get_unwrap(1); |
| Ok((git_id, name)) |
| })? |
| .collect::<Result<Vec<_>, _>>()?; |
| let ids: Vec<_> = rows.iter().map(|r| r.0).collect(); |
| let id_map = Self::get_id_map(conn, GIT_DB_TABLE, &ids)?; |
| for (id, name) in rows { |
| let encoded_git_name = &id_map[&id]; |
| delete_paths.push(base_path.join(encoded_git_name).join(name)); |
| } |
| Ok(()) |
| } |
| |
| /// Adds paths to delete from `git_db` in order to keep the total size |
| /// under the given max size. |
| fn get_git_db_items_to_clean( |
| conn: &Connection, |
| max_age: Timestamp, |
| base: &BasePaths, |
| delete_paths: &mut Vec<PathBuf>, |
| ) -> CargoResult<()> { |
| debug!(target: "gc", "cleaning git db since {max_age:?}"); |
| let mut stmt = conn.prepare_cached( |
| "DELETE FROM git_db WHERE timestamp < ?1 |
| RETURNING name", |
| )?; |
| let mut rows = stmt.query([max_age])?; |
| while let Some(row) = rows.next()? { |
| let name: String = row.get_unwrap(0); |
| delete_paths.push(base.git_db.join(&name)); |
| // Also delete checkout directories, since by definition they |
| // cannot be used without their db. |
| delete_paths.push(base.git_co.join(&name)); |
| } |
| Ok(()) |
| } |
| } |
| |
| /// Helper to generate the upsert for the parent tables. |
| /// |
| /// This handles checking if the row already exists, and only updates the |
| /// timestamp it if it hasn't been updated recently. This also handles keeping |
| /// a cached map of the `id` value. |
| /// |
| /// Unfortunately it is a bit tricky to share this code without a macro. |
| macro_rules! insert_or_update_parent { |
| ($self:expr, $conn:expr, $table_name:expr, $timestamps_field:ident, $keys_field:ident, $encoded_name:ident) => { |
| let mut select_stmt = $conn.prepare_cached(concat!( |
| "SELECT id, timestamp FROM ", |
| $table_name, |
| " WHERE name = ?1" |
| ))?; |
| let mut insert_stmt = $conn.prepare_cached(concat!( |
| "INSERT INTO ", |
| $table_name, |
| " (name, timestamp) |
| VALUES (?1, ?2) |
| ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp |
| RETURNING id", |
| ))?; |
| let mut update_stmt = $conn.prepare_cached(concat!( |
| "UPDATE ", |
| $table_name, |
| " SET timestamp = ?1 WHERE id = ?2" |
| ))?; |
| for (parent, new_timestamp) in std::mem::take(&mut $self.$timestamps_field) { |
| trace!(target: "gc", |
| concat!("insert ", $table_name, " {:?} {}"), |
| parent, |
| new_timestamp |
| ); |
| let mut rows = select_stmt.query([parent.$encoded_name])?; |
| let id = if let Some(row) = rows.next()? { |
| let id: ParentId = row.get_unwrap(0); |
| let timestamp: Timestamp = row.get_unwrap(1); |
| if timestamp < new_timestamp - UPDATE_RESOLUTION { |
| update_stmt.execute(params![new_timestamp, id])?; |
| } |
| id |
| } else { |
| insert_stmt.query_row(params![parent.$encoded_name, new_timestamp], |row| { |
| row.get(0) |
| })? |
| }; |
| match $self.$keys_field.entry(parent.$encoded_name) { |
| hash_map::Entry::Occupied(o) => { |
| assert_eq!(*o.get(), id); |
| } |
| hash_map::Entry::Vacant(v) => { |
| v.insert(id); |
| } |
| } |
| } |
| return Ok(()); |
| }; |
| } |
| |
| /// This is a cache of modifications that will be saved to disk all at once |
| /// via the [`DeferredGlobalLastUse::save`] method. |
| /// |
| /// This is here to improve performance. |
| #[derive(Debug)] |
| pub struct DeferredGlobalLastUse { |
| /// Cache of registry keys, used for faster fetching. |
| /// |
| /// The key is the registry name (which is its directory name) and the |
| /// value is the `id` in the `registry_index` table. |
| registry_keys: HashMap<InternedString, ParentId>, |
| /// Cache of git keys, used for faster fetching. |
| /// |
| /// The key is the git db name (which is its directory name) and the value |
| /// is the `id` in the `git_db` table. |
| git_keys: HashMap<InternedString, ParentId>, |
| |
| /// New registry index entries to insert. |
| registry_index_timestamps: HashMap<RegistryIndex, Timestamp>, |
| /// New registry `.crate` entries to insert. |
| registry_crate_timestamps: HashMap<RegistryCrate, Timestamp>, |
| /// New registry src directory entries to insert. |
| registry_src_timestamps: HashMap<RegistrySrc, Timestamp>, |
| /// New git db entries to insert. |
| git_db_timestamps: HashMap<GitDb, Timestamp>, |
| /// New git checkout entries to insert. |
| git_checkout_timestamps: HashMap<GitCheckout, Timestamp>, |
| /// This is used so that a warning about failing to update the database is |
| /// only displayed once. |
| save_err_has_warned: bool, |
| /// The current time, used to improve performance to avoid accessing the |
| /// clock hundreds of times. |
| now: Timestamp, |
| } |
| |
| impl DeferredGlobalLastUse { |
| pub fn new() -> DeferredGlobalLastUse { |
| DeferredGlobalLastUse { |
| registry_keys: HashMap::new(), |
| git_keys: HashMap::new(), |
| registry_index_timestamps: HashMap::new(), |
| registry_crate_timestamps: HashMap::new(), |
| registry_src_timestamps: HashMap::new(), |
| git_db_timestamps: HashMap::new(), |
| git_checkout_timestamps: HashMap::new(), |
| save_err_has_warned: false, |
| now: now(), |
| } |
| } |
| |
| pub fn is_empty(&self) -> bool { |
| self.registry_index_timestamps.is_empty() |
| && self.registry_crate_timestamps.is_empty() |
| && self.registry_src_timestamps.is_empty() |
| && self.git_db_timestamps.is_empty() |
| && self.git_checkout_timestamps.is_empty() |
| } |
| |
| fn clear(&mut self) { |
| self.registry_index_timestamps.clear(); |
| self.registry_crate_timestamps.clear(); |
| self.registry_src_timestamps.clear(); |
| self.git_db_timestamps.clear(); |
| self.git_checkout_timestamps.clear(); |
| } |
| |
| /// Indicates the given [`RegistryIndex`] has been used right now. |
| pub fn mark_registry_index_used(&mut self, registry_index: RegistryIndex) { |
| self.mark_registry_index_used_stamp(registry_index, None); |
| } |
| |
| /// Indicates the given [`RegistryCrate`] has been used right now. |
| /// |
| /// Also implicitly marks the index used, too. |
| pub fn mark_registry_crate_used(&mut self, registry_crate: RegistryCrate) { |
| self.mark_registry_crate_used_stamp(registry_crate, None); |
| } |
| |
| /// Indicates the given [`RegistrySrc`] has been used right now. |
| /// |
| /// Also implicitly marks the index used, too. |
| pub fn mark_registry_src_used(&mut self, registry_src: RegistrySrc) { |
| self.mark_registry_src_used_stamp(registry_src, None); |
| } |
| |
| /// Indicates the given [`GitCheckout`] has been used right now. |
| /// |
| /// Also implicitly marks the git db used, too. |
| pub fn mark_git_checkout_used(&mut self, git_checkout: GitCheckout) { |
| self.mark_git_checkout_used_stamp(git_checkout, None); |
| } |
| |
| /// Indicates the given [`RegistryIndex`] has been used with the given |
| /// time (or "now" if `None`). |
| pub fn mark_registry_index_used_stamp( |
| &mut self, |
| registry_index: RegistryIndex, |
| timestamp: Option<&SystemTime>, |
| ) { |
| let timestamp = timestamp.map_or(self.now, to_timestamp); |
| self.registry_index_timestamps |
| .insert(registry_index, timestamp); |
| } |
| |
| /// Indicates the given [`RegistryCrate`] has been used with the given |
| /// time (or "now" if `None`). |
| /// |
| /// Also implicitly marks the index used, too. |
| pub fn mark_registry_crate_used_stamp( |
| &mut self, |
| registry_crate: RegistryCrate, |
| timestamp: Option<&SystemTime>, |
| ) { |
| let timestamp = timestamp.map_or(self.now, to_timestamp); |
| let index = RegistryIndex { |
| encoded_registry_name: registry_crate.encoded_registry_name, |
| }; |
| self.registry_index_timestamps.insert(index, timestamp); |
| self.registry_crate_timestamps |
| .insert(registry_crate, timestamp); |
| } |
| |
| /// Indicates the given [`RegistrySrc`] has been used with the given |
| /// time (or "now" if `None`). |
| /// |
| /// Also implicitly marks the index used, too. |
| pub fn mark_registry_src_used_stamp( |
| &mut self, |
| registry_src: RegistrySrc, |
| timestamp: Option<&SystemTime>, |
| ) { |
| let timestamp = timestamp.map_or(self.now, to_timestamp); |
| let index = RegistryIndex { |
| encoded_registry_name: registry_src.encoded_registry_name, |
| }; |
| self.registry_index_timestamps.insert(index, timestamp); |
| self.registry_src_timestamps.insert(registry_src, timestamp); |
| } |
| |
| /// Indicates the given [`GitCheckout`] has been used with the given |
| /// time (or "now" if `None`). |
| /// |
| /// Also implicitly marks the git db used, too. |
| pub fn mark_git_checkout_used_stamp( |
| &mut self, |
| git_checkout: GitCheckout, |
| timestamp: Option<&SystemTime>, |
| ) { |
| let timestamp = timestamp.map_or(self.now, to_timestamp); |
| let db = GitDb { |
| encoded_git_name: git_checkout.encoded_git_name, |
| }; |
| self.git_db_timestamps.insert(db, timestamp); |
| self.git_checkout_timestamps.insert(git_checkout, timestamp); |
| } |
| |
| /// Saves all of the deferred information to the database. |
| /// |
| /// This will also clear the state of `self`. |
| #[tracing::instrument(skip_all)] |
| pub fn save(&mut self, tracker: &mut GlobalCacheTracker) -> CargoResult<()> { |
| trace!(target: "gc", "saving last-use data"); |
| if self.is_empty() { |
| return Ok(()); |
| } |
| let tx = tracker.conn.transaction()?; |
| // These must run before the ones that refer to their IDs. |
| self.insert_registry_index_from_cache(&tx)?; |
| self.insert_git_db_from_cache(&tx)?; |
| self.insert_registry_crate_from_cache(&tx)?; |
| self.insert_registry_src_from_cache(&tx)?; |
| self.insert_git_checkout_from_cache(&tx)?; |
| tx.commit()?; |
| trace!(target: "gc", "last-use save complete"); |
| Ok(()) |
| } |
| |
| /// Variant of [`DeferredGlobalLastUse::save`] that does not return an |
| /// error. |
| /// |
| /// This will log or display a warning to the user. |
| pub fn save_no_error(&mut self, gctx: &GlobalContext) { |
| if let Err(e) = self.save_with_gctx(gctx) { |
| // Because there is an assertion in auto-gc that checks if this is |
| // empty, be sure to clear it so that assertion doesn't fail. |
| self.clear(); |
| if !self.save_err_has_warned { |
| if is_silent_error(&e) && gctx.shell().verbosity() != Verbosity::Verbose { |
| tracing::warn!("failed to save last-use data: {e:?}"); |
| } else { |
| crate::display_warning_with_error( |
| "failed to save last-use data\n\ |
| This may prevent cargo from accurately tracking what is being \ |
| used in its global cache. This information is used for \ |
| automatically removing unused data in the cache.", |
| &e, |
| &mut gctx.shell(), |
| ); |
| self.save_err_has_warned = true; |
| } |
| } |
| } |
| } |
| |
| fn save_with_gctx(&mut self, gctx: &GlobalContext) -> CargoResult<()> { |
| let mut tracker = gctx.global_cache_tracker()?; |
| self.save(&mut tracker) |
| } |
| |
| /// Flushes all of the `registry_index_timestamps` to the database, |
| /// clearing `registry_index_timestamps`. |
| fn insert_registry_index_from_cache(&mut self, conn: &Connection) -> CargoResult<()> { |
| insert_or_update_parent!( |
| self, |
| conn, |
| "registry_index", |
| registry_index_timestamps, |
| registry_keys, |
| encoded_registry_name |
| ); |
| } |
| |
| /// Flushes all of the `git_db_timestamps` to the database, |
| /// clearing `registry_index_timestamps`. |
| fn insert_git_db_from_cache(&mut self, conn: &Connection) -> CargoResult<()> { |
| insert_or_update_parent!( |
| self, |
| conn, |
| "git_db", |
| git_db_timestamps, |
| git_keys, |
| encoded_git_name |
| ); |
| } |
| |
| /// Flushes all of the `registry_crate_timestamps` to the database, |
| /// clearing `registry_index_timestamps`. |
| fn insert_registry_crate_from_cache(&mut self, conn: &Connection) -> CargoResult<()> { |
| let registry_crate_timestamps = std::mem::take(&mut self.registry_crate_timestamps); |
| for (registry_crate, timestamp) in registry_crate_timestamps { |
| trace!(target: "gc", "insert registry crate {registry_crate:?} {timestamp}"); |
| let registry_id = self.registry_id(conn, registry_crate.encoded_registry_name)?; |
| let mut stmt = conn.prepare_cached( |
| "INSERT INTO registry_crate (registry_id, name, size, timestamp) |
| VALUES (?1, ?2, ?3, ?4) |
| ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp |
| WHERE timestamp < ?5 |
| ", |
| )?; |
| stmt.execute(params![ |
| registry_id, |
| registry_crate.crate_filename, |
| registry_crate.size, |
| timestamp, |
| timestamp - UPDATE_RESOLUTION |
| ])?; |
| } |
| Ok(()) |
| } |
| |
| /// Flushes all of the `registry_src_timestamps` to the database, |
| /// clearing `registry_index_timestamps`. |
| fn insert_registry_src_from_cache(&mut self, conn: &Connection) -> CargoResult<()> { |
| let registry_src_timestamps = std::mem::take(&mut self.registry_src_timestamps); |
| for (registry_src, timestamp) in registry_src_timestamps { |
| trace!(target: "gc", "insert registry src {registry_src:?} {timestamp}"); |
| let registry_id = self.registry_id(conn, registry_src.encoded_registry_name)?; |
| let mut stmt = conn.prepare_cached( |
| "INSERT INTO registry_src (registry_id, name, size, timestamp) |
| VALUES (?1, ?2, ?3, ?4) |
| ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp |
| WHERE timestamp < ?5 |
| ", |
| )?; |
| stmt.execute(params![ |
| registry_id, |
| registry_src.package_dir, |
| registry_src.size, |
| timestamp, |
| timestamp - UPDATE_RESOLUTION |
| ])?; |
| } |
| |
| Ok(()) |
| } |
| |
| /// Flushes all of the `git_checkout_timestamps` to the database, |
| /// clearing `registry_index_timestamps`. |
| fn insert_git_checkout_from_cache(&mut self, conn: &Connection) -> CargoResult<()> { |
| let git_checkout_timestamps = std::mem::take(&mut self.git_checkout_timestamps); |
| for (git_checkout, timestamp) in git_checkout_timestamps { |
| let git_id = self.git_id(conn, git_checkout.encoded_git_name)?; |
| let mut stmt = conn.prepare_cached( |
| "INSERT INTO git_checkout (git_id, name, size, timestamp) |
| VALUES (?1, ?2, ?3, ?4) |
| ON CONFLICT DO UPDATE SET timestamp=excluded.timestamp |
| WHERE timestamp < ?5", |
| )?; |
| stmt.execute(params![ |
| git_id, |
| git_checkout.short_name, |
| git_checkout.size, |
| timestamp, |
| timestamp - UPDATE_RESOLUTION |
| ])?; |
| } |
| |
| Ok(()) |
| } |
| |
| /// Returns the numeric ID of the registry, either fetching from the local |
| /// cache, or getting it from the database. |
| /// |
| /// It is an error if the registry does not exist. |
| fn registry_id( |
| &mut self, |
| conn: &Connection, |
| encoded_registry_name: InternedString, |
| ) -> CargoResult<ParentId> { |
| match self.registry_keys.get(&encoded_registry_name) { |
| Some(i) => Ok(*i), |
| None => { |
| let Some(id) = GlobalCacheTracker::id_from_name( |
| conn, |
| REGISTRY_INDEX_TABLE, |
| &encoded_registry_name, |
| )? |
| else { |
| bail!("expected registry_index {encoded_registry_name} to exist, but wasn't found"); |
| }; |
| self.registry_keys.insert(encoded_registry_name, id); |
| Ok(id) |
| } |
| } |
| } |
| |
| /// Returns the numeric ID of the git db, either fetching from the local |
| /// cache, or getting it from the database. |
| /// |
| /// It is an error if the git db does not exist. |
| fn git_id( |
| &mut self, |
| conn: &Connection, |
| encoded_git_name: InternedString, |
| ) -> CargoResult<ParentId> { |
| match self.git_keys.get(&encoded_git_name) { |
| Some(i) => Ok(*i), |
| None => { |
| let Some(id) = |
| GlobalCacheTracker::id_from_name(conn, GIT_DB_TABLE, &encoded_git_name)? |
| else { |
| bail!("expected git_db {encoded_git_name} to exist, but wasn't found") |
| }; |
| self.git_keys.insert(encoded_git_name, id); |
| Ok(id) |
| } |
| } |
| } |
| } |
| |
| /// Converts a [`SystemTime`] to a [`Timestamp`] which can be stored in the database. |
| fn to_timestamp(t: &SystemTime) -> Timestamp { |
| t.duration_since(SystemTime::UNIX_EPOCH) |
| .expect("invalid clock") |
| .as_secs() |
| } |
| |
| /// Returns the current time. |
| /// |
| /// This supports pretending that the time is different for testing using an |
| /// environment variable. |
| /// |
| /// If possible, try to avoid calling this too often since accessing clocks |
| /// can be a little slow on some systems. |
| #[allow(clippy::disallowed_methods)] |
| fn now() -> Timestamp { |
| match std::env::var("__CARGO_TEST_LAST_USE_NOW") { |
| Ok(now) => now.parse().unwrap(), |
| Err(_) => to_timestamp(&SystemTime::now()), |
| } |
| } |
| |
| /// Returns whether or not the given error should cause a warning to be |
| /// displayed to the user. |
| /// |
| /// In some situations, like a read-only global cache, we don't want to spam |
| /// the user with a warning. I think once cargo has controllable lints, I |
| /// think we should consider changing this to always warn, but give the user |
| /// an option to silence the warning. |
| pub fn is_silent_error(e: &anyhow::Error) -> bool { |
| if let Some(e) = e.downcast_ref::<rusqlite::Error>() { |
| if matches!( |
| e.sqlite_error_code(), |
| Some(ErrorCode::CannotOpen | ErrorCode::ReadOnly) |
| ) { |
| return true; |
| } |
| } |
| false |
| } |
| |
| /// Returns the disk usage for a git checkout directory. |
| pub fn du_git_checkout(path: &Path) -> CargoResult<u64> { |
| // !.git is used because clones typically use hardlinks for the git |
| // contents. TODO: Verify behavior on Windows. |
| // TODO: Or even better, switch to worktrees, and remove this. |
| cargo_util::du(&path, &["!.git"]) |
| } |
| |
| fn du(path: &Path, table_name: &str) -> CargoResult<u64> { |
| if table_name == GIT_CO_TABLE { |
| du_git_checkout(path) |
| } else { |
| cargo_util::du(&path, &[]) |
| } |
| } |