| //! A `Source` for registry-based packages. |
| //! |
| //! # What's a Registry? |
| //! |
| //! Registries are central locations where packages can be uploaded to, |
| //! discovered, and searched for. The purpose of a registry is to have a |
| //! location that serves as permanent storage for versions of a crate over time. |
| //! |
| //! Compared to git sources, a registry provides many packages as well as many |
| //! versions simultaneously. Git sources can also have commits deleted through |
| //! rebasings where registries cannot have their versions deleted. |
| //! |
| //! # The Index of a Registry |
| //! |
| //! One of the major difficulties with a registry is that hosting so many |
| //! packages may quickly run into performance problems when dealing with |
| //! dependency graphs. It's infeasible for cargo to download the entire contents |
| //! of the registry just to resolve one package's dependencies, for example. As |
| //! a result, cargo needs some efficient method of querying what packages are |
| //! available on a registry, what versions are available, and what the |
| //! dependencies for each version is. |
| //! |
| //! One method of doing so would be having the registry expose an HTTP endpoint |
| //! which can be queried with a list of packages and a response of their |
| //! dependencies and versions is returned. This is somewhat inefficient however |
| //! as we may have to hit the endpoint many times and we may have already |
| //! queried for much of the data locally already (for other packages, for |
| //! example). This also involves inventing a transport format between the |
| //! registry and Cargo itself, so this route was not taken. |
| //! |
| //! Instead, Cargo communicates with registries through a git repository |
| //! referred to as the Index. The Index of a registry is essentially an easily |
| //! query-able version of the registry's database for a list of versions of a |
| //! package as well as a list of dependencies for each version. |
| //! |
| //! Using git to host this index provides a number of benefits: |
| //! |
| //! * The entire index can be stored efficiently locally on disk. This means |
| //! that all queries of a registry can happen locally and don't need to touch |
| //! the network. |
| //! |
| //! * Updates of the index are quite efficient. Using git buys incremental |
| //! updates, compressed transmission, etc for free. The index must be updated |
| //! each time we need fresh information from a registry, but this is one |
| //! update of a git repository that probably hasn't changed a whole lot so |
| //! it shouldn't be too expensive. |
| //! |
| //! Additionally, each modification to the index is just appending a line at |
| //! the end of a file (the exact format is described later). This means that |
| //! the commits for an index are quite small and easily applied/compressable. |
| //! |
| //! ## The format of the Index |
| //! |
| //! The index is a store for the list of versions for all packages known, so its |
| //! format on disk is optimized slightly to ensure that `ls registry` doesn't |
| //! produce a list of all packages ever known. The index also wants to ensure |
| //! that there's not a million files which may actually end up hitting |
| //! filesystem limits at some point. To this end, a few decisions were made |
| //! about the format of the registry: |
| //! |
| //! 1. Each crate will have one file corresponding to it. Each version for a |
| //! crate will just be a line in this file. |
| //! 2. There will be two tiers of directories for crate names, under which |
| //! crates corresponding to those tiers will be located. |
| //! |
| //! As an example, this is an example hierarchy of an index: |
| //! |
| //! ```notrust |
| //! . |
| //! ├── 3 |
| //! │ └── u |
| //! │ └── url |
| //! ├── bz |
| //! │ └── ip |
| //! │ └── bzip2 |
| //! ├── config.json |
| //! ├── en |
| //! │ └── co |
| //! │ └── encoding |
| //! └── li |
| //! ├── bg |
| //! │ └── libgit2 |
| //! └── nk |
| //! └── link-config |
| //! ``` |
| //! |
| //! The root of the index contains a `config.json` file with a few entries |
| //! corresponding to the registry (see `RegistryConfig` below). |
| //! |
| //! Otherwise, there are three numbered directories (1, 2, 3) for crates with |
| //! names 1, 2, and 3 characters in length. The 1/2 directories simply have the |
| //! crate files underneath them, while the 3 directory is sharded by the first |
| //! letter of the crate name. |
| //! |
| //! Otherwise the top-level directory contains many two-letter directory names, |
| //! each of which has many sub-folders with two letters. At the end of all these |
| //! are the actual crate files themselves. |
| //! |
| //! The purpose of this layout is to hopefully cut down on `ls` sizes as well as |
| //! efficient lookup based on the crate name itself. |
| //! |
| //! ## Crate files |
| //! |
| //! Each file in the index is the history of one crate over time. Each line in |
| //! the file corresponds to one version of a crate, stored in JSON format (see |
| //! the `RegistryPackage` structure below). |
| //! |
| //! As new versions are published, new lines are appended to this file. The only |
| //! modifications to this file that should happen over time are yanks of a |
| //! particular version. |
| //! |
| //! # Downloading Packages |
| //! |
| //! The purpose of the Index was to provide an efficient method to resolve the |
| //! dependency graph for a package. So far we only required one network |
| //! interaction to update the registry's repository (yay!). After resolution has |
| //! been performed, however we need to download the contents of packages so we |
| //! can read the full manifest and build the source code. |
| //! |
| //! To accomplish this, this source's `download` method will make an HTTP |
| //! request per-package requested to download tarballs into a local cache. These |
| //! tarballs will then be unpacked into a destination folder. |
| //! |
| //! Note that because versions uploaded to the registry are frozen forever that |
| //! the HTTP download and unpacking can all be skipped if the version has |
| //! already been downloaded and unpacked. This caching allows us to only |
| //! download a package when absolutely necessary. |
| //! |
| //! # Filesystem Hierarchy |
| //! |
| //! Overall, the `$HOME/.cargo` looks like this when talking about the registry: |
| //! |
| //! ```notrust |
| //! # A folder under which all registry metadata is hosted (similar to |
| //! # $HOME/.cargo/git) |
| //! $HOME/.cargo/registry/ |
| //! |
| //! # For each registry that cargo knows about (keyed by hostname + hash) |
| //! # there is a folder which is the checked out version of the index for |
| //! # the registry in this location. Note that this is done so cargo can |
| //! # support multiple registries simultaneously |
| //! index/ |
| //! registry1-<hash>/ |
| //! registry2-<hash>/ |
| //! ... |
| //! |
| //! # This folder is a cache for all downloaded tarballs from a registry. |
| //! # Once downloaded and verified, a tarball never changes. |
| //! cache/ |
| //! registry1-<hash>/<pkg>-<version>.crate |
| //! ... |
| //! |
| //! # Location in which all tarballs are unpacked. Each tarball is known to |
| //! # be frozen after downloading, so transitively this folder is also |
| //! # frozen once its unpacked (it's never unpacked again) |
| //! src/ |
| //! registry1-<hash>/<pkg>-<version>/... |
| //! ... |
| //! ``` |
| |
| use std::borrow::Cow; |
| use std::collections::BTreeMap; |
| use std::collections::HashSet; |
| use std::io::Write; |
| use std::path::{Path, PathBuf}; |
| |
| use flate2::read::GzDecoder; |
| use log::debug; |
| use semver::Version; |
| use serde::Deserialize; |
| use tar::Archive; |
| |
| use crate::core::dependency::{Dependency, Kind}; |
| use crate::core::source::MaybePackage; |
| use crate::core::{Package, PackageId, Source, SourceId, Summary}; |
| use crate::sources::PathSource; |
| use crate::util::errors::CargoResultExt; |
| use crate::util::hex; |
| use crate::util::to_url::ToUrl; |
| use crate::util::{internal, CargoResult, Config, FileLock, Filesystem}; |
| |
| const INDEX_LOCK: &str = ".cargo-index-lock"; |
| const PACKAGE_SOURCE_LOCK: &str = ".cargo-ok"; |
| pub const CRATES_IO_INDEX: &str = "https://github.com/rust-lang/crates.io-index"; |
| pub const CRATES_IO_REGISTRY: &str = "crates-io"; |
| const CRATE_TEMPLATE: &str = "{crate}"; |
| const VERSION_TEMPLATE: &str = "{version}"; |
| |
| pub struct RegistrySource<'cfg> { |
| source_id: SourceId, |
| src_path: Filesystem, |
| config: &'cfg Config, |
| updated: bool, |
| ops: Box<dyn RegistryData + 'cfg>, |
| index: index::RegistryIndex<'cfg>, |
| yanked_whitelist: HashSet<PackageId>, |
| index_locked: bool, |
| } |
| |
| #[derive(Deserialize)] |
| pub struct RegistryConfig { |
| /// Download endpoint for all crates. |
| /// |
| /// The string is a template which will generate the download URL for the |
| /// tarball of a specific version of a crate. The substrings `{crate}` and |
| /// `{version}` will be replaced with the crate's name and version |
| /// respectively. |
| /// |
| /// For backwards compatibility, if the string does not contain `{crate}` or |
| /// `{version}`, it will be extended with `/{crate}/{version}/download` to |
| /// support registries like crates.io which were crated before the |
| /// templating setup was created. |
| pub dl: String, |
| |
| /// API endpoint for the registry. This is what's actually hit to perform |
| /// operations like yanks, owner modifications, publish new crates, etc. |
| /// If this is None, the registry does not support API commands. |
| pub api: Option<String>, |
| } |
| |
| #[derive(Deserialize)] |
| pub struct RegistryPackage<'a> { |
| name: Cow<'a, str>, |
| vers: Version, |
| deps: Vec<RegistryDependency<'a>>, |
| features: BTreeMap<Cow<'a, str>, Vec<Cow<'a, str>>>, |
| cksum: String, |
| yanked: Option<bool>, |
| links: Option<Cow<'a, str>>, |
| } |
| |
| #[test] |
| fn escaped_cher_in_json() { |
| let _: RegistryPackage<'_> = serde_json::from_str( |
| r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{}}"#, |
| ) |
| .unwrap(); |
| let _: RegistryPackage<'_> = serde_json::from_str( |
| r#"{"name":"a","vers":"0.0.1","deps":[],"cksum":"bae3","features":{"test":["k","q"]},"links":"a-sys"}"# |
| ).unwrap(); |
| |
| // Now we add escaped cher all the places they can go |
| // these are not valid, but it should error later than json parsing |
| let _: RegistryPackage<'_> = serde_json::from_str( |
| r#"{ |
| "name":"This name has a escaped cher in it \n\t\" ", |
| "vers":"0.0.1", |
| "deps":[{ |
| "name": " \n\t\" ", |
| "req": " \n\t\" ", |
| "features": [" \n\t\" "], |
| "optional": true, |
| "default_features": true, |
| "target": " \n\t\" ", |
| "kind": " \n\t\" ", |
| "registry": " \n\t\" " |
| }], |
| "cksum":"bae3", |
| "features":{"test \n\t\" ":["k \n\t\" ","q \n\t\" "]}, |
| "links":" \n\t\" "}"#, |
| ) |
| .unwrap(); |
| } |
| |
| #[derive(Deserialize)] |
| #[serde(field_identifier, rename_all = "lowercase")] |
| enum Field { |
| Name, |
| Vers, |
| Deps, |
| Features, |
| Cksum, |
| Yanked, |
| Links, |
| } |
| |
| #[derive(Deserialize)] |
| struct RegistryDependency<'a> { |
| name: Cow<'a, str>, |
| req: Cow<'a, str>, |
| features: Vec<Cow<'a, str>>, |
| optional: bool, |
| default_features: bool, |
| target: Option<Cow<'a, str>>, |
| kind: Option<Cow<'a, str>>, |
| registry: Option<Cow<'a, str>>, |
| package: Option<Cow<'a, str>>, |
| } |
| |
| impl<'a> RegistryDependency<'a> { |
| /// Converts an encoded dependency in the registry to a cargo dependency |
| pub fn into_dep(self, default: SourceId) -> CargoResult<Dependency> { |
| let RegistryDependency { |
| name, |
| req, |
| mut features, |
| optional, |
| default_features, |
| target, |
| kind, |
| registry, |
| package, |
| } = self; |
| |
| let id = if let Some(registry) = ®istry { |
| SourceId::for_registry(®istry.to_url()?)? |
| } else { |
| default |
| }; |
| |
| let mut dep = |
| Dependency::parse_no_deprecated(package.as_ref().unwrap_or(&name), Some(&req), id)?; |
| if package.is_some() { |
| dep.set_explicit_name_in_toml(&name); |
| } |
| let kind = match kind.as_ref().map(|s| &s[..]).unwrap_or("") { |
| "dev" => Kind::Development, |
| "build" => Kind::Build, |
| _ => Kind::Normal, |
| }; |
| |
| let platform = match target { |
| Some(target) => Some(target.parse()?), |
| None => None, |
| }; |
| |
| // Unfortunately older versions of cargo and/or the registry ended up |
| // publishing lots of entries where the features array contained the |
| // empty feature, "", inside. This confuses the resolution process much |
| // later on and these features aren't actually valid, so filter them all |
| // out here. |
| features.retain(|s| !s.is_empty()); |
| |
| // In index, "registry" is null if it is from the same index. |
| // In Cargo.toml, "registry" is None if it is from the default |
| if !id.is_default_registry() { |
| dep.set_registry_id(id); |
| } |
| |
| dep.set_optional(optional) |
| .set_default_features(default_features) |
| .set_features(features) |
| .set_platform(platform) |
| .set_kind(kind); |
| |
| Ok(dep) |
| } |
| } |
| |
| pub trait RegistryData { |
| fn prepare(&self) -> CargoResult<()>; |
| fn index_path(&self) -> &Filesystem; |
| fn load( |
| &self, |
| _root: &Path, |
| path: &Path, |
| data: &mut dyn FnMut(&[u8]) -> CargoResult<()>, |
| ) -> CargoResult<()>; |
| fn config(&mut self) -> CargoResult<Option<RegistryConfig>>; |
| fn update_index(&mut self) -> CargoResult<()>; |
| fn download(&mut self, pkg: PackageId, checksum: &str) -> CargoResult<MaybeLock>; |
| fn finish_download( |
| &mut self, |
| pkg: PackageId, |
| checksum: &str, |
| data: &[u8], |
| ) -> CargoResult<FileLock>; |
| |
| fn is_crate_downloaded(&self, _pkg: PackageId) -> bool { |
| true |
| } |
| } |
| |
| pub enum MaybeLock { |
| Ready(FileLock), |
| Download { url: String, descriptor: String }, |
| } |
| |
| mod index; |
| mod local; |
| mod remote; |
| |
| fn short_name(id: SourceId) -> String { |
| let hash = hex::short_hash(&id); |
| let ident = id.url().host_str().unwrap_or("").to_string(); |
| format!("{}-{}", ident, hash) |
| } |
| |
| impl<'cfg> RegistrySource<'cfg> { |
| pub fn remote( |
| source_id: SourceId, |
| yanked_whitelist: &HashSet<PackageId>, |
| config: &'cfg Config, |
| ) -> RegistrySource<'cfg> { |
| let name = short_name(source_id); |
| let ops = remote::RemoteRegistry::new(source_id, config, &name); |
| RegistrySource::new( |
| source_id, |
| config, |
| &name, |
| Box::new(ops), |
| yanked_whitelist, |
| true, |
| ) |
| } |
| |
| pub fn local( |
| source_id: SourceId, |
| path: &Path, |
| yanked_whitelist: &HashSet<PackageId>, |
| config: &'cfg Config, |
| ) -> RegistrySource<'cfg> { |
| let name = short_name(source_id); |
| let ops = local::LocalRegistry::new(path, config, &name); |
| RegistrySource::new( |
| source_id, |
| config, |
| &name, |
| Box::new(ops), |
| yanked_whitelist, |
| false, |
| ) |
| } |
| |
| fn new( |
| source_id: SourceId, |
| config: &'cfg Config, |
| name: &str, |
| ops: Box<dyn RegistryData + 'cfg>, |
| yanked_whitelist: &HashSet<PackageId>, |
| index_locked: bool, |
| ) -> RegistrySource<'cfg> { |
| RegistrySource { |
| src_path: config.registry_source_path().join(name), |
| config, |
| source_id, |
| updated: false, |
| index: index::RegistryIndex::new(source_id, ops.index_path(), config, index_locked), |
| yanked_whitelist: yanked_whitelist.clone(), |
| index_locked, |
| ops, |
| } |
| } |
| |
| /// Decode the configuration stored within the registry. |
| /// |
| /// This requires that the index has been at least checked out. |
| pub fn config(&mut self) -> CargoResult<Option<RegistryConfig>> { |
| self.ops.config() |
| } |
| |
| /// Unpacks a downloaded package into a location where it's ready to be |
| /// compiled. |
| /// |
| /// No action is taken if the source looks like it's already unpacked. |
| fn unpack_package(&self, pkg: PackageId, tarball: &FileLock) -> CargoResult<PathBuf> { |
| // The `.cargo-ok` file is used to track if the source is already |
| // unpacked and to lock the directory for unpacking. |
| let mut ok = { |
| let package_dir = format!("{}-{}", pkg.name(), pkg.version()); |
| let dst = self.src_path.join(&package_dir); |
| dst.create_dir()?; |
| |
| // Attempt to open a read-only copy first to avoid an exclusive write |
| // lock and also work with read-only filesystems. If the file has |
| // any data, assume the source is already unpacked. |
| if let Ok(ok) = dst.open_ro(PACKAGE_SOURCE_LOCK, self.config, &package_dir) { |
| let meta = ok.file().metadata()?; |
| if meta.len() > 0 { |
| let unpack_dir = ok.parent().to_path_buf(); |
| return Ok(unpack_dir); |
| } |
| } |
| |
| dst.open_rw(PACKAGE_SOURCE_LOCK, self.config, &package_dir)? |
| }; |
| let unpack_dir = ok.parent().to_path_buf(); |
| |
| // If the file has any data, assume the source is already unpacked. |
| let meta = ok.file().metadata()?; |
| if meta.len() > 0 { |
| return Ok(unpack_dir); |
| } |
| |
| let gz = GzDecoder::new(tarball.file()); |
| let mut tar = Archive::new(gz); |
| let prefix = unpack_dir.file_name().unwrap(); |
| let parent = unpack_dir.parent().unwrap(); |
| for entry in tar.entries()? { |
| let mut entry = entry.chain_err(|| "failed to iterate over archive")?; |
| let entry_path = entry |
| .path() |
| .chain_err(|| "failed to read entry path")? |
| .into_owned(); |
| |
| // We're going to unpack this tarball into the global source |
| // directory, but we want to make sure that it doesn't accidentally |
| // (or maliciously) overwrite source code from other crates. Cargo |
| // itself should never generate a tarball that hits this error, and |
| // crates.io should also block uploads with these sorts of tarballs, |
| // but be extra sure by adding a check here as well. |
| if !entry_path.starts_with(prefix) { |
| failure::bail!( |
| "invalid tarball downloaded, contains \ |
| a file at {:?} which isn't under {:?}", |
| entry_path, |
| prefix |
| ) |
| } |
| |
| // Once that's verified, unpack the entry as usual. |
| entry |
| .unpack_in(parent) |
| .chain_err(|| format!("failed to unpack entry at `{}`", entry_path.display()))?; |
| } |
| |
| // Write to the lock file to indicate that unpacking was successful. |
| write!(ok, "ok")?; |
| |
| Ok(unpack_dir) |
| } |
| |
| fn do_update(&mut self) -> CargoResult<()> { |
| self.ops.update_index()?; |
| let path = self.ops.index_path(); |
| self.index = |
| index::RegistryIndex::new(self.source_id, path, self.config, self.index_locked); |
| Ok(()) |
| } |
| |
| fn get_pkg(&mut self, package: PackageId, path: &FileLock) -> CargoResult<Package> { |
| let path = self |
| .unpack_package(package, path) |
| .chain_err(|| internal(format!("failed to unpack package `{}`", package)))?; |
| let mut src = PathSource::new(&path, self.source_id, self.config); |
| src.update()?; |
| let pkg = match src.download(package)? { |
| MaybePackage::Ready(pkg) => pkg, |
| MaybePackage::Download { .. } => unreachable!(), |
| }; |
| Ok(pkg) |
| } |
| } |
| |
| impl<'cfg> Source for RegistrySource<'cfg> { |
| fn query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> { |
| // If this is a precise dependency, then it came from a lock file and in |
| // theory the registry is known to contain this version. If, however, we |
| // come back with no summaries, then our registry may need to be |
| // updated, so we fall back to performing a lazy update. |
| if dep.source_id().precise().is_some() && !self.updated { |
| debug!("attempting query without update"); |
| let mut called = false; |
| self.index |
| .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| { |
| if dep.matches(&s) { |
| called = true; |
| f(s); |
| } |
| })?; |
| if called { |
| return Ok(()); |
| } else { |
| debug!("falling back to an update"); |
| self.do_update()?; |
| } |
| } |
| |
| self.index |
| .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, &mut |s| { |
| if dep.matches(&s) { |
| f(s); |
| } |
| }) |
| } |
| |
| fn fuzzy_query(&mut self, dep: &Dependency, f: &mut dyn FnMut(Summary)) -> CargoResult<()> { |
| self.index |
| .query_inner(dep, &mut *self.ops, &self.yanked_whitelist, f) |
| } |
| |
| fn supports_checksums(&self) -> bool { |
| true |
| } |
| |
| fn requires_precise(&self) -> bool { |
| false |
| } |
| |
| fn source_id(&self) -> SourceId { |
| self.source_id |
| } |
| |
| fn update(&mut self) -> CargoResult<()> { |
| // If we have an imprecise version then we don't know what we're going |
| // to look for, so we always attempt to perform an update here. |
| // |
| // If we have a precise version, then we'll update lazily during the |
| // querying phase. Note that precise in this case is only |
| // `Some("locked")` as other `Some` values indicate a `cargo update |
| // --precise` request |
| if self.source_id.precise() != Some("locked") { |
| self.do_update()?; |
| } else { |
| debug!("skipping update due to locked registry"); |
| } |
| Ok(()) |
| } |
| |
| fn download(&mut self, package: PackageId) -> CargoResult<MaybePackage> { |
| let hash = self.index.hash(package, &mut *self.ops)?; |
| match self.ops.download(package, &hash)? { |
| MaybeLock::Ready(file) => self.get_pkg(package, &file).map(MaybePackage::Ready), |
| MaybeLock::Download { url, descriptor } => { |
| Ok(MaybePackage::Download { url, descriptor }) |
| } |
| } |
| } |
| |
| fn finish_download(&mut self, package: PackageId, data: Vec<u8>) -> CargoResult<Package> { |
| let hash = self.index.hash(package, &mut *self.ops)?; |
| let file = self.ops.finish_download(package, &hash, &data)?; |
| self.get_pkg(package, &file) |
| } |
| |
| fn fingerprint(&self, pkg: &Package) -> CargoResult<String> { |
| Ok(pkg.package_id().version().to_string()) |
| } |
| |
| fn describe(&self) -> String { |
| self.source_id.display_registry() |
| } |
| |
| fn add_to_yanked_whitelist(&mut self, pkgs: &[PackageId]) { |
| self.yanked_whitelist.extend(pkgs); |
| } |
| } |