catapult/third_party/gsutil/gslib/addlhelp/prod.py - platform/external/chromium-trace - Git at Google

 # -*- coding: utf-8 -*-
 # Copyright 2012 Google Inc. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Additional help about using gsutil for production tasks."""

 from __future__ import absolute_import

 from gslib.help_provider import HelpProvider

 _DETAILED_HELP_TEXT = ("""
 <B>OVERVIEW</B>
   If you use gsutil in large production tasks (such as uploading or
   downloading many GiBs of data each night), there are a number of things
   you can do to help ensure success. Specifically, this section discusses
   how to script large production tasks around gsutil's resumable transfer
   mechanism.


 <B>BACKGROUND ON RESUMABLE TRANSFERS</B>
   First, it's helpful to understand gsutil's resumable transfer mechanism,
   and how your script needs to be implemented around this mechanism to work
   reliably. gsutil uses resumable transfer support when you attempt to upload
   or download a file larger than a configurable threshold (by default, this
   threshold is 2 MiB). When a transfer fails partway through (e.g., because of
   an intermittent network problem), gsutil uses a truncated randomized binary
   exponential backoff-and-retry strategy that by default will retry transfers up
   to 23 times over a 10 minute period of time (see "gsutil help retries" for
   details). If the transfer fails each of these attempts with no intervening
   progress, gsutil gives up on the transfer, but keeps a "tracker" file for
   it in a configurable location (the default location is ~/.gsutil/, in a file
   named by a combination of the SHA1 hash of the name of the bucket and object
   being transferred and the last 16 characters of the file name). When transfers
   fail in this fashion, you can rerun gsutil at some later time (e.g., after
   the networking problem has been resolved), and the resumable transfer picks
   up where it left off.


 <B>SCRIPTING DATA TRANSFER TASKS</B>
   To script large production data transfer tasks around this mechanism,
   you can implement a script that runs periodically, determines which file
   transfers have not yet succeeded, and runs gsutil to copy them. Below,
   we offer a number of suggestions about how this type of scripting should
   be implemented:

   1. When resumable transfers fail without any progress 23 times in a row
      over the course of up to 10 minutes, it probably won't work to simply
      retry the transfer immediately. A more successful strategy would be to
      have a cron job that runs every 30 minutes, determines which transfers
      need to be run, and runs them. If the network experiences intermittent
      problems, the script picks up where it left off and will eventually
      succeed (once the network problem has been resolved).

   2. If your business depends on timely data transfer, you should consider
      implementing some network monitoring. For example, you can implement
      a task that attempts a small download every few minutes and raises an
      alert if the attempt fails for several attempts in a row (or more or less
      frequently depending on your requirements), so that your IT staff can
      investigate problems promptly. As usual with monitoring implementations,
      you should experiment with the alerting thresholds, to avoid false
      positive alerts that cause your staff to begin ignoring the alerts.

   3. There are a variety of ways you can determine what files remain to be
      transferred. We recommend that you avoid attempting to get a complete
      listing of a bucket containing many objects (e.g., tens of thousands
      or more). One strategy is to structure your object names in a way that
      represents your transfer process, and use gsutil prefix wildcards to
      request partial bucket listings. For example, if your periodic process
      involves downloading the current day's objects, you could name objects
      using a year-month-day-object-ID format and then find today's objects by
      using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it
      is more efficient to have a non-wildcard prefix like this than to use
      something like gsutil ls "gs://bucket/*-2011-09-27". The latter command
      actually requests a complete bucket listing and then filters in gsutil,
      while the former asks Google Storage to return the subset of objects
      whose names start with everything up to the "*".

      For data uploads, another technique would be to move local files from a "to
      be processed" area to a "done" area as your script successfully copies
      files to the cloud. You can do this in parallel batches by using a command
      like:

        gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i

      where i is a shell loop variable. Make sure to check the shell $status
      variable is 0 after each gsutil cp command, to detect if some of the copies
      failed, and rerun the affected copies.

      With this strategy, the file system keeps track of all remaining work to
      be done.

   4. If you have really large numbers of objects in a single bucket
      (say hundreds of thousands or more), you should consider tracking your
      objects in a database instead of using bucket listings to enumerate
      the objects. For example this database could track the state of your
      downloads, so you can determine what objects need to be downloaded by
      your periodic download script by querying the database locally instead
      of performing a bucket listing.

   5. Make sure you don't delete partially downloaded temporary files after a
      transfer fails: gsutil picks up where it left off (and performs a hash
      of the final downloaded content to ensure data integrity), so deleting
      partially transferred files will cause you to lose progress and make
      more wasteful use of your network.

   6. If you have a fast network connection, you can speed up the transfer of
      large numbers of files by using the gsutil -m (multi-threading /
      multi-processing) option. Be aware, however, that gsutil doesn't attempt to
      keep track of which files were downloaded successfully in cases where some
      files failed to download. For example, if you use multi-threaded transfers
      to download 100 files and 3 failed to download, it is up to your scripting
      process to determine which transfers didn't succeed, and retry them. A
      periodic check-and-run approach like outlined earlier would handle this
      case.

      If you use parallel transfers (gsutil -m) you might want to experiment with
      the number of threads being used (via the parallel_thread_count setting
      in the .boto config file). By default, gsutil uses 10 threads for Linux
      and 24 threads for other operating systems. Depending on your network
      speed, available memory, CPU load, and other conditions, this may or may
      not be optimal. Try experimenting with higher or lower numbers of threads
      to find the best number of threads for your environment.
 """)


 class CommandOptions(HelpProvider):
   """Additional help about using gsutil for production tasks."""

   # Help specification. See help_provider.py for documentation.
   help_spec = HelpProvider.HelpSpec(
       help_name='prod',
       help_name_aliases=[
           'production', 'resumable', 'resumable upload', 'resumable transfer',
           'resumable download', 'scripts', 'scripting'],
       help_type='additional_help',
       help_one_line_summary='Scripting Production Transfers',
       help_text=_DETAILED_HELP_TEXT,
       subcommand_help_text={},
   )
	# -- coding: utf-8 --
	# Copyright 2012 Google Inc. All Rights Reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,
	# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	# See the License for the specific language governing permissions and
	# limitations under the License.
	"""Additional help about using gsutil for production tasks."""

	from __future__ import absolute_import

	from gslib.help_provider import HelpProvider

	_DETAILED_HELP_TEXT = ("""
	<B>OVERVIEW</B>
	If you use gsutil in large production tasks (such as uploading or
	downloading many GiBs of data each night), there are a number of things
	you can do to help ensure success. Specifically, this section discusses
	how to script large production tasks around gsutil's resumable transfer
	mechanism.


	<B>BACKGROUND ON RESUMABLE TRANSFERS</B>
	First, it's helpful to understand gsutil's resumable transfer mechanism,
	and how your script needs to be implemented around this mechanism to work
	reliably. gsutil uses resumable transfer support when you attempt to upload
	or download a file larger than a configurable threshold (by default, this
	threshold is 2 MiB). When a transfer fails partway through (e.g., because of
	an intermittent network problem), gsutil uses a truncated randomized binary
	exponential backoff-and-retry strategy that by default will retry transfers up
	to 23 times over a 10 minute period of time (see "gsutil help retries" for
	details). If the transfer fails each of these attempts with no intervening
	progress, gsutil gives up on the transfer, but keeps a "tracker" file for
	it in a configurable location (the default location is ~/.gsutil/, in a file
	named by a combination of the SHA1 hash of the name of the bucket and object
	being transferred and the last 16 characters of the file name). When transfers
	fail in this fashion, you can rerun gsutil at some later time (e.g., after
	the networking problem has been resolved), and the resumable transfer picks
	up where it left off.


	<B>SCRIPTING DATA TRANSFER TASKS</B>
	To script large production data transfer tasks around this mechanism,
	you can implement a script that runs periodically, determines which file
	transfers have not yet succeeded, and runs gsutil to copy them. Below,
	we offer a number of suggestions about how this type of scripting should
	be implemented:

	1. When resumable transfers fail without any progress 23 times in a row
	over the course of up to 10 minutes, it probably won't work to simply
	retry the transfer immediately. A more successful strategy would be to
	have a cron job that runs every 30 minutes, determines which transfers
	need to be run, and runs them. If the network experiences intermittent
	problems, the script picks up where it left off and will eventually
	succeed (once the network problem has been resolved).

	2. If your business depends on timely data transfer, you should consider
	implementing some network monitoring. For example, you can implement
	a task that attempts a small download every few minutes and raises an
	alert if the attempt fails for several attempts in a row (or more or less
	frequently depending on your requirements), so that your IT staff can
	investigate problems promptly. As usual with monitoring implementations,
	you should experiment with the alerting thresholds, to avoid false
	positive alerts that cause your staff to begin ignoring the alerts.

	3. There are a variety of ways you can determine what files remain to be
	transferred. We recommend that you avoid attempting to get a complete
	listing of a bucket containing many objects (e.g., tens of thousands
	or more). One strategy is to structure your object names in a way that
	represents your transfer process, and use gsutil prefix wildcards to
	request partial bucket listings. For example, if your periodic process
	involves downloading the current day's objects, you could name objects
	using a year-month-day-object-ID format and then find today's objects by
	using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it
	is more efficient to have a non-wildcard prefix like this than to use
	something like gsutil ls "gs://bucket/*-2011-09-27". The latter command
	actually requests a complete bucket listing and then filters in gsutil,
	while the former asks Google Storage to return the subset of objects
	whose names start with everything up to the "*".

	For data uploads, another technique would be to move local files from a "to
	be processed" area to a "done" area as your script successfully copies
	files to the cloud. You can do this in parallel batches by using a command
	like:

	gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i

	where i is a shell loop variable. Make sure to check the shell $status
	variable is 0 after each gsutil cp command, to detect if some of the copies
	failed, and rerun the affected copies.

	With this strategy, the file system keeps track of all remaining work to
	be done.

	4. If you have really large numbers of objects in a single bucket
	(say hundreds of thousands or more), you should consider tracking your
	objects in a database instead of using bucket listings to enumerate
	the objects. For example this database could track the state of your
	downloads, so you can determine what objects need to be downloaded by
	your periodic download script by querying the database locally instead
	of performing a bucket listing.

	5. Make sure you don't delete partially downloaded temporary files after a
	transfer fails: gsutil picks up where it left off (and performs a hash
	of the final downloaded content to ensure data integrity), so deleting
	partially transferred files will cause you to lose progress and make
	more wasteful use of your network.

	6. If you have a fast network connection, you can speed up the transfer of
	large numbers of files by using the gsutil -m (multi-threading /
	multi-processing) option. Be aware, however, that gsutil doesn't attempt to
	keep track of which files were downloaded successfully in cases where some
	files failed to download. For example, if you use multi-threaded transfers
	to download 100 files and 3 failed to download, it is up to your scripting
	process to determine which transfers didn't succeed, and retry them. A
	periodic check-and-run approach like outlined earlier would handle this
	case.

	If you use parallel transfers (gsutil -m) you might want to experiment with
	the number of threads being used (via the parallel_thread_count setting
	in the .boto config file). By default, gsutil uses 10 threads for Linux
	and 24 threads for other operating systems. Depending on your network
	speed, available memory, CPU load, and other conditions, this may or may
	not be optimal. Try experimenting with higher or lower numbers of threads
	to find the best number of threads for your environment.
	""")


	class CommandOptions(HelpProvider):
	"""Additional help about using gsutil for production tasks."""

	# Help specification. See help_provider.py for documentation.
	help_spec = HelpProvider.HelpSpec(
	help_name='prod',
	help_name_aliases=[
	'production', 'resumable', 'resumable upload', 'resumable transfer',
	'resumable download', 'scripts', 'scripting'],
	help_type='additional_help',
	help_one_line_summary='Scripting Production Transfers',
	help_text=_DETAILED_HELP_TEXT,
	subcommand_help_text={},
	)