| # -*- coding: utf-8 -*- |
| # Copyright 2014 Google Inc. All Rights Reserved. |
| # |
| # Licensed under the Apache License, Version 2.0 (the "License"); |
| # you may not use this file except in compliance with the License. |
| # You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, software |
| # distributed under the License is distributed on an "AS IS" BASIS, |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| # See the License for the specific language governing permissions and |
| # limitations under the License. |
| """Additional help about CRC32C and installing crcmod.""" |
| |
| from __future__ import absolute_import |
| |
| from gslib.help_provider import HelpProvider |
| |
| _DETAILED_HELP_TEXT = (""" |
| <B>OVERVIEW</B> |
| To minimize the chance for `filename encoding interoperability problems |
| <https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability>`_ |
| gsutil requires use of the `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_ |
| character encoding when uploading and downloading files. Because UTF-8 is in |
| widespread (and growing) use, for most users nothing needs to be done to use |
| UTF-8. Users with files stored in other encodings (such as |
| `Latin 1 <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>`_) must convert those |
| filenames to UTF-8 before attempting to upload the files. |
| |
| The most common place where users who have filenames that use some other |
| encoding encounter a gsutil error is while uploading files using the recursive |
| (-R) option on the gsutil cp , mv, or rsync commands. When this happens you'll |
| get an error like this: |
| |
| CommandException: Invalid Unicode path encountered |
| ('dir1/dir2/file_name_with_\\xf6n_bad_chars'). |
| gsutil cannot proceed with such files present. |
| Please remove or rename this file and try again. |
| |
| Note that the invalid Unicode characters have been hex-encoded in this error |
| message because otherwise trying to print them would result in another |
| error. |
| |
| If you encounter such an error you can either remove the problematic file(s) |
| or try to rename them and re-run the command. If you have a modest number of |
| such files the simplest thing to do is to think of a different name for the |
| file and manually rename the file (using local filesystem tools). If you have |
| too many files for that to be practical you can use a tool to convert the old |
| character encoding to UTF-8. One such tool is `native2ascii |
| <http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.html>`_. |
| |
| Unicode errors for valid Unicode filepaths can be caused by lack of Python |
| locale configuration on Linux and Mac OSes. If your file paths are Unicode |
| and you get encoding errors, ensure the LANG environment variable is set |
| correctly. Typically, the LANG variable should be set to something like |
| "en_US.UTF-8" or "de_DE.UTF-8". |
| |
| Note also that there's no restriction on the character encoding used in file |
| content - it can be UTF-8, a different encoding, or non-character |
| data (like audio or video content). The gsutil UTF-8 character encoding |
| requirement applies only to filenames. |
| |
| |
| <B>USING UNICODE FILENAMES ON WINDOWS</B> |
| Windows support for Unicode in the command shell (cmd.exe or powershell) is |
| somewhat painful, because Windows uses a Windows-specific character encoding |
| called `cp1252 <https://en.wikipedia.org/wiki/Windows-1252>`_. To use Unicode |
| characters you need to run this command in the command shell before the first |
| time you use gsutil in that shell: |
| |
| chcp 65001 |
| |
| If you neglect to do this before using gsutil, the progress messages while |
| uploading files with Unicode names or listing buckets with Unicode object |
| names will look garbled (i.e., with different glyphs than you expect in the |
| output). If you simply run the chcp command and re-run the gsutil command, the |
| output should no longer look garbled. |
| |
| gsutil attempts to translate between cp1252 encoding and UTF-8 in the main |
| places that Unicode encoding/decoding problems have been encountered to date |
| (traversing the local file system while uploading files, and printing Unicode |
| names while listing buckets). However, because gsutil must perform |
| translation, it is likely there are other erroneous edge cases when using |
| Windows with Unicode. If you encounter problems, you might consider instead |
| using cygwin (on Windows) or Linux or MacOS - all of which support Unicode. |
| |
| |
| <B>CROSS-PLATFORM ENCODING PROBLEMS OF WHICH TO BE AWARE</B> |
| Using UTF-8 for all object names and filenames will ensure that gsutil doesn't |
| encounter character encoding errors while operating on the files. |
| Unfortunately, it's still possible that files uploaded / downloaded this way |
| can have interoperability problems, for a number of reasons unrelated to |
| gsutil. For example: |
| |
| - Windows filenames are case-insensitive, while Google Cloud Storage, Linux, |
| and MacOS are not. Thus, for example, if you have two filenames on Linux |
| differing only in case and upload both to Google Cloud Storage and then |
| subsequently download them to Windows, you will end up with just one file |
| whose contents came from the last of these files to be written to the |
| filesystem. |
| - Mac OS performs character encoding decomposition based on tables stored in |
| the OS, and the tables change between Unicode versions. Thus the encoding |
| used by an external library may not match that performed by the OS. It is |
| possible that two object names may translate to a single local filename. |
| - Windows console support for Unicode is difficult to use correctly. |
| |
| For a more thorough list of such issues see `this presentation |
| <http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf>`_ |
| |
| These problems mostly arise when sharing data across platforms (e.g., |
| uploading data from a Windows machine to Google Cloud Storage, and then |
| downloading from Google Cloud Storage to a machine running MacOS). |
| Unfortunately these problems are a consequence of the lack of a filename |
| encoding standard, and users need to be aware of the kinds of problems that |
| can arise when copying filenames across platforms. |
| |
| There is one precaution users can exercise to prevent some of these problems: |
| When using the Windows console specify wildcards or folders (using the -R |
| option) rather than explicitly named individual files. |
| |
| |
| <B>CONVERTING FILENAMES TO UNICODE</B> |
| Open-source tools are available to convert filenames for non-Unicode files. |
| For example, to convert from latin1 (a common Windows encoding) to Unicode, |
| you can use |
| `Windows iconv <http://gnuwin32.sourceforge.net/packages/libiconv.htm>`_. |
| For Unix-based systems, you can use |
| `libiconv <https://www.gnu.org/software/libiconv/>`_. |
| """) |
| |
| |
| class CommandOptions(HelpProvider): |
| """Additional help about filename encoding and interoperability problems.""" |
| |
| # Help specification. See help_provider.py for documentation. |
| help_spec = HelpProvider.HelpSpec( |
| help_name='encoding', |
| help_name_aliases=['encodings', 'utf8', 'utf-8', 'latin1', 'unicode', |
| 'interoperability'], |
| help_type='additional_help', |
| help_one_line_summary='Filename encoding and interoperability problems', |
| help_text=_DETAILED_HELP_TEXT, |
| subcommand_help_text={}, |
| ) |