src/site/xdoc/zip.xml - platform/external/apache-commons-compress - Git at Google

 <?xml version="1.0"?>
 <!--

    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.

 -->
 <document>
   <properties>
     <title>Commons Compress ZIP package</title>
     <author email="dev@commons.apache.org">Commons Documentation Team</author>
   </properties>
   <body>
     <section name="The ZIP package">

       <p>The ZIP package provides features not found
         in <code>java.util.zip</code>:</p>

       <ul>
         <li>Support for encodings other than UTF-8 for filenames and
           comments.  Starting with Java7 this is supported
           by <code>java.util.zip</code> as well.</li>
         <li>Access to internal and external attributes (which are used
           to store Unix permission by some zip implementations).</li>
         <li>Structured support for extra fields.</li>
       </ul>

       <p>In addition to the information stored
         in <code>ArchiveEntry</code> a <code>ZipArchiveEntry</code>
         stores internal and external attributes as well as extra
         fields which may contain information like Unix permissions,
         information about the platform they've been created on, their
         last modification time and an optional comment.</p>

       <subsection name="ZipArchiveInputStream vs ZipFile">

         <p>ZIP archives store a archive entries in sequence and
           contain a registry of all entries at the very end of the
           archive.  It is acceptable for an archive to contain several
           entries of the same name and have the registry (called the
           central directory) decide which entry is actually to be used
           (if any).</p>

         <p>In addition the ZIP format stores certain information only
           inside the central directory but not together with the entry
           itself, this is:</p>

         <ul>
           <li>internal and external attributes</li>
           <li>different or additional extra fields</li>
         </ul>

         <p>This means the ZIP format cannot really be parsed
           correctly while reading a non-seekable stream, which is what
           <code>ZipArchiveInputStream</code> is forced to do.  As a
           result <code>ZipArchiveInputStream</code></p>
         <ul>
           <li>may return entries that are not part of the central
             directory at all and shouldn't be considered part of the
             archive.</li>
           <li>may return several entries with the same name.</li>
           <li>will not return internal or external attributes.</li>
           <li>may return incomplete extra field data.</li>
           <li>may return unknown sizes and CRC values for entries
           until the next entry has been reached if the archive uses
           the data descriptor feature (see below).</li>
         </ul>

         <p><code>ZipArchiveInputStream</code> shares these limitations
           with <code>java.util.zip.ZipInputStream</code>.</p>

         <p><code>ZipFile</code> is able to read the central directory
           first and provide correct and complete information on any
           ZIP archive.</p>

         <p>ZIP archives know a feature called the data descriptor
           which is a way to store an entry's length after the entry's
           data.  This can only work reliably if the size information
           can be taken from the central directory or the data itself
           can signal it is complete, which is true for data that is
           compressed using the DEFLATED compression algorithm.</p>

         <p><code>ZipFile</code> has access to the central directory
           and can extract entries using the data descriptor reliably.
           The same is true for <code>ZipArchiveInputStream</code> as
           long as the entry is DEFLATED.  For STORED
           entries <code>ZipArchiveInputStream</code> can try to read
           ahead until it finds the next entry, but this approach is
           not safe and has to be enabled by a constructor argument
           explicitly.</p>

         <p>If possible, you should always prefer <code>ZipFile</code>
           over <code>ZipArchiveInputStream</code>.</p>

         <p><code>ZipFile</code> requires a
         <code>SeekableByteChannel</code> that will be obtained
         transparently when reading from a file. The class
         <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
         allows you to read from an in-memory archive.</p>

       </subsection>

       <subsection name="ZipArchiveOutputStream" id="ZipArchiveOutputStream">
         <p><code>ZipArchiveOutputStream</code> has three constructors,
         one of them uses a <code>File</code> argument, one a
         <code>SeekableByteChannel</code> and the last uses an
         <code>OutputStream</code>.  The <code>File</code> version will
         try to use <code>SeekableByteChannel</code> and fall back to
         using a <code>FileOutputStream</code> internally if that
         fails.</p>

         <p>If <code>ZipArchiveOutputStream</code> can
           use <code>SeekableByteChannel</code> it can employ some
           optimizations that lead to smaller archives.  It also makes
           it possible to add uncompressed (<code>setMethod</code> used
           with <code>STORED</code>) entries of unknown size when
           calling <code>putArchiveEntry</code> - this is not allowed
           if <code>ZipArchiveOutputStream</code> has to use
           an <code>OutputStream</code>.</p>

         <p>If you know you are writing to a file, you should always
         prefer the <code>File</code>- or
         <code>SeekableByteChannel</code>-arg constructors.  The class
         <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
         allows you to write to an in-memory archive.</p>

       </subsection>

       <subsection name="Extra Fields">

         <p>Inside a ZIP archive, additional data can be attached to
           each entry.  The <code>java.util.zip.ZipEntry</code> class
           provides access to this via the <code>get/setExtra</code>
           methods as arrays of <code>byte</code>s.</p>

         <p>Actually the extra data is supposed to be more structured
           than that and Compress' ZIP package provides access to the
           structured data as <code>ExtraField</code> instances.  Only
           a subset of all defined extra field formats is supported by
           the package, any other extra field will be stored
           as <code>UnrecognizedExtraField</code>.</p>

         <p>Prior to version 1.1 of this library trying to read an
           archive with extra fields that didn't follow the recommended
           structure for those fields would cause Compress to throw an
           exception.  Starting with version 1.1 these extra fields
           will now be read
           as <code>UnparseableExtraFieldData</code>.</p>

       </subsection>

       <subsection name="Encoding" id="encoding">

         <p>Traditionally the ZIP archive format uses CodePage 437 as
           encoding for file name, which is not sufficient for many
           international character sets.</p>

         <p>Over time different archivers have chosen different ways to
           work around the limitation - the <code>java.util.zip</code>
           packages simply uses UTF-8 as its encoding for example.</p>

         <p>Ant has been offering the encoding attribute of the zip and
           unzip task as a way to explicitly specify the encoding to
           use (or expect) since Ant 1.4.  It defaults to the
           platform's default encoding for zip and UTF-8 for jar and
           other jar-like tasks (war, ear, ...) as well as the unzip
           family of tasks.</p>

         <p>More recent versions of the ZIP specification introduce
           something called the &quot;language encoding flag&quot;
           which can be used to signal that a file name has been
           encoded using UTF-8.  All ZIP-archives written by Compress
           will set this flag, if the encoding has been set to UTF-8.
           Our interoperability tests with existing archivers didn't
           show any ill effects (in fact, most archivers ignore the
           flag to date), but you can turn off the "language encoding
           flag" by setting the attribute
           <code>useLanguageEncodingFlag</code> to <code>false</code> on the
           <code>ZipArchiveOutputStream</code> if you should encounter
           problems.</p>

         <p>The <code>ZipFile</code>
           and <code>ZipArchiveInputStream</code> classes will
           recognize the language encoding flag and ignore the encoding
           set in the constructor if it has been found.</p>

         <p>The InfoZIP developers have introduced new ZIP extra fields
           that can be used to add an additional UTF-8 encoded file
           name to the entry's metadata.  Most archivers ignore these
           extra fields.  <code>ZipArchiveOutputStream</code> supports
           an option <code>createUnicodeExtraFields</code> which makes
           it write these extra fields either for all entries
           ("always") or only those whose name cannot be encoded using
           the specified encoding (not-encodeable), it defaults to
           "never" since the extra fields create bigger archives.</p>

         <p>The fallbackToUTF8 attribute
           of <code>ZipArchiveOutputStream</code> can be used to create
           archives that use the specified encoding in the majority of
           cases but UTF-8 and the language encoding flag for filenames
           that cannot be encoded using the specified encoding.</p>

         <p>The <code>ZipFile</code>
           and <code>ZipArchiveInputStream</code> classes recognize the
           Unicode extra fields by default and read the file name
           information from them, unless you set the constructor parameter
           <code>scanForUnicodeExtraFields</code> to false.</p>

         <h4>Recommendations for Interoperability</h4>

         <p>The optimal setting of flags depends on the archivers you
           expect as consumers/producers of the ZIP archives.  Below
           are some test results which may be superseded with later
           versions of each tool.</p>

         <ul>
           <li>The java.util.zip package used by the jar executable or
             to read jars from your CLASSPATH reads and writes UTF-8
             names, it doesn't set or recognize any flags or Unicode
             extra fields.</li>

           <li>Starting with Java7 <code>java.util.zip</code> writes
             UTF-8 by default and uses the language encoding flag.  It
             is possible to specify a different encoding when
             reading/writing ZIPs via new constructors.  The package
             now recognizes the language encoding flag when reading and
             ignores the Unicode extra fields.</li>

           <li>7Zip writes CodePage 437 by default but uses UTF-8 and
             the language encoding flag when writing entries that
             cannot be encoded as CodePage 437 (similar to the zip task
             with fallbacktoUTF8 set to true).  It recognizes the
             language encoding flag when reading and ignores the
             Unicode extra fields.</li>

           <li>WinZIP writes CodePage 437 and uses Unicode extra fields
             by default.  It recognizes the Unicode extra field and the
             language encoding flag when reading.</li>

           <li>Windows' "compressed folder" feature doesn't recognize
             any flag or extra field and creates archives using the
             platforms default encoding - and expects archives to be in
             that encoding when reading them.</li>

           <li>InfoZIP based tools can recognize and write both, it is
             a compile time option and depends on the platform so your
             mileage may vary.</li>

           <li>PKWARE zip tools recognize both and prefer the language
             encoding flag.  They create archives using CodePage 437 if
             possible and UTF-8 plus the language encoding flag for
             file names that cannot be encoded as CodePage 437.</li>
         </ul>

         <p>So, what to do?</p>

         <p>If you are creating jars, then java.util.zip is your main
           consumer.  We recommend you set the encoding to UTF-8 and
           keep the language encoding flag enabled.  The flag won't
           help or hurt java.util.zip prior to Java7 but archivers that
           support it will show the correct file names.</p>

         <p>For maximum interop it is probably best to set the encoding
           to UTF-8, enable the language encoding flag and create
           Unicode extra fields when writing ZIPs.  Such archives
           should be extracted correctly by java.util.zip, 7Zip,
           WinZIP, PKWARE tools and most likely InfoZIP tools.  They
           will be unusable with Windows' "compressed folders" feature
           and bigger than archives without the Unicode extra fields,
           though.</p>

         <p>If Windows' "compressed folders" is your primary consumer,
           then your best option is to explicitly set the encoding to
           the target platform.  You may want to enable creation of
           Unicode extra fields so the tools that support them will
           extract the file names correctly.</p>
       </subsection>

       <subsection name="Encryption and Alternative Compression Algorithms"
                   id="encryption">

         <p>In most cases entries of an archive are not encrypted and
         are either not compressed at all or use the DEFLATE
         algorithm, Commons Compress' ZIP archiver will handle them
         just fine.   As of version 1.7, Commons Compress can also
         decompress entries compressed with the legacy SHRINK and
         IMPLODE algorithms of PKZIP 1.x.  Version 1.11 of Commons
         Compress adds read-only support for BZIP2.  Version 1.16 adds
         read-only support for DEFLATE64 - also known as "enhanced DEFLATE".</p>

         <p>The ZIP specification allows for various other compression
         algorithms and also supports several different ways of
         encrypting archive contents.  Neither of those methods is
         currently supported by Commons Compress and any such entry can
         not be extracted by the archiving code.</p>

         <p><code>ZipFile</code>'s and
         <code>ZipArchiveInputStream</code>'s
         <code>canReadEntryData</code> methods will return false for
         encrypted entries or entries using an unsupported encryption
         mechanism.  Using this method it is possible to at least
         detect and skip the entries that can not be extracted.</p>

         <table>
           <thead>
             <tr>
               <th>Version of Apache Commons Compress</th>
               <th>Supported Compression Methods</th>
               <th>Supported Encryption Methods</th>
             </tr>
           </thead>
           <tbody>
             <tr>
               <td>1.0 to 1.6</td>
               <td>STORED, DEFLATE</td>
               <td>-</td>
             </tr>
             <tr>
               <td>1.7 to 1.10</td>
               <td>STORED, DEFLATE, SHRINK, IMPLODE</td>
               <td>-</td>
             </tr>
             <tr>
               <td>1.11 to 1.15</td>
               <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2</td>
               <td>-</td>
             </tr>
             <tr>
               <td>1.16 and later</td>
               <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2, DEFLATE64
               (enhanced deflate)</td>
               <td>-</td>
             </tr>
           </tbody>
         </table>

       </subsection>

       <subsection name="Zip64 Support" id="zip64">
         <p>The traditional ZIP format is limited to archive sizes of
           four gibibyte (actually 2<sup>32</sup> - 1 bytes &#x2248;
           4.3 GB) and 65635 entries, where each individual entry is
           limited to four gibibyte as well.  These limits seemed
           excessive in the 1980s.</p>

         <p>Version 4.5 of the ZIP specification introduced the so
           called "Zip64 extensions" to push those limitations for
           compressed or uncompressed sizes of up to 16 exbibyte
           (actually 2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB, i.e
           18.5 x 10<sup>18</sup> bytes) in archives that themselves
           can take up to 16 exbibyte containing more than
           18 x 10<sup>18</sup> entries.</p>

         <p>Apache Commons Compress 1.2 and earlier do not support
           Zip64 extensions at all.</p>

         <p>Starting with Apache Commons Compress
           1.3 <code>ZipArchiveInputStream</code>
           and <code>ZipFile</code> transparently support Zip64
           extensions.  By default <code>ZipArchiveOutputStream</code>
           supports them transparently as well (i.e. it adds Zip64
           extensions if needed and doesn't use them for
           entries/archives that don't need them) if the compressed and
           uncompressed sizes of the entry are known
           when <code>putArchiveEntry</code> is called
           or <code>ZipArchiveOutputStream</code>
           uses <code>SeekableByteChannel</code>
           (see <a href="#ZipArchiveOutputStream">above</a>).  If only
           the uncompressed size is
           known <code>ZipArchiveOutputStream</code> will assume the
           compressed size will not be bigger than the uncompressed
           size.</p>

         <p><code>ZipArchiveOutputStream</code>'s
           <code>setUseZip64</code> can be used to control the behavior.
           <code>Zip64Mode.AsNeeded</code> is the default behavior
           described in the previous paragraph.</p>

         <p>If <code>ZipArchiveOutputStream</code> is writing to a
           non-seekable stream it has to decide whether to use Zip64
           extensions or not before it starts wrtiting the entry data.
           This means that if the size of the entry is unknown
           when <code>putArchiveEntry</code> is called it doesn't have
           anything to base the decision on.  By default it will not
           use Zip64 extensions in order to create archives that can be
           extracted by older archivers (it will later throw an
           exception in <code>closeEntry</code> if it detects Zip64
           extensions had been needed).  It is possible to
           instruct <code>ZipArchiveOutputStream</code> to always
           create Zip64 extensions by using
           the <code>setUseZip64</code> with an argument
           of <code>Zip64Mode.Always</code>; use this if you are
           writing entries of unknown size to a stream and expect some
           of them to be too big to fit into the traditional
           limits.</p>

         <p><code>Zip64Mode.Always</code> creates archives that use
           Zip64 extensions for all entries, even those that don't
           require them.  Such archives will be slightly bigger than
           archives created with one of the other modes and not be
           readable by unarchivers that don't support Zip64
           extensions.</p>

         <p><code>Zip64Mode.Never</code> will not use any Zip64
           extensions at all and may lead to
           a <code>Zip64RequiredException</code> to be thrown
           if <code>ZipArchiveOutputStream</code> detects that one of
           the format's limits is exceeded.  Archives created in this
           mode will be readable by all unarchivers; they may be
           slightly smaller than archives created
           with <code>SeekableByteChannel</code>
           in <code>Zip64Mode.AsNeeded</code> mode if some of the
           entries had unknown sizes.</p>

         <p>The <code>java.util.zip</code> package and the
           <code>jar</code> command of Java5 and earlier can not read
           Zip64 extensions and will fail if the archive contains any.
           So if you intend to create archives that Java5 can consume
           you must set the mode to <code>Zip64Mode.Never</code></p>

         <h4>Known Limitations</h4>

         <p>Some of the theoretical limits of the format are not
           reached because Apache Commons Compress' own API
           (<code>ArchiveEntry</code>'s size information uses
           a <code>long</code>) or its usage of Java collections
           or <code>SeekableByteChannel</code> internally.  The table
           below shows the theoretical limits supported by Apache
           Commons Compress.  In practice it is very likely that you'd
           run out of memory or your file system won't allow files that
           big long before you reach either limit.</p>

         <table>
           <thead>
             <tr>
               <th/>
               <th>Max. Size of Archive</th>
               <th>Max. Compressed/Uncompressed Size of Entry</th>
               <th>Max. Number of Entries</th>
             </tr>
           </thead>
           <tbody>
             <tr>
               <td>ZIP Format Without Zip 64 Extensions</td>
               <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
               <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
               <td>65535</td>
             </tr>
             <tr>
               <td>ZIP Format using Zip 64 Extensions</td>
               <td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
               <td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
               <td>2<sup>64</sup> - 1 &#x2248; 18.5 x 10<sup>18</sup></td>
             </tr>
             <tr>
               <td>Commons Compress 1.2 and earlier</td>
               <td>unlimited in <code>ZipArchiveInputStream</code>
                 and <code>ZipArchiveOutputStream</code> and
                 2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB
                 in <code>ZipFile</code>.</td>
               <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
               <td>unlimited in <code>ZipArchiveInputStream</code>,
                 65535 in <code>ZipArchiveOutputStream</code>
                 and <code>ZipFile</code>.</td>
             </tr>
             <tr>
               <td>Commons Compress 1.3 and later</td>
               <td>unlimited in <code>ZipArchiveInputStream</code>
                 and <code>ZipArchiveOutputStream</code> and
                 2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB
                 in <code>ZipFile</code>.</td>
               <td>2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB</td>
               <td>unlimited in <code>ZipArchiveInputStream</code>,
                 2<sup>31</sup> - 1 &#x2248; 2.1 billion
                 in <code>ZipArchiveOutputStream</code>
                 and <code>ZipFile</code>.</td>
             </tr>
           </tbody>
         </table>

         <h4>Known Interoperability Problems</h4>

         <p>The <code>java.util.zip</code> package of OpenJDK7 supports
         Zip 64 extensions but its <code>ZipInputStream</code> and
         <code>ZipFile</code> classes will be unable to extract
         archives created with Commons Compress 1.3's
         <code>ZipArchiveOutputStream</code> if the archive contains
         entries that use the data descriptor, are smaller than 4 GiB
         and have Zip 64 extensions enabled.  I.e. the classes in
         OpenJDK currently only support archives that use Zip 64
         extensions only when they are actually needed.  These classes
         are used to load JAR files and are the base for the
         <code>jar</code> command line utility as well.</p>
       </subsection>

       <subsection name="Consuming Archives Completely">

         <p>Prior to version 1.5 <code>ZipArchiveInputStream</code>
         would return null from <code>getNextEntry</code> or
         <code>getNextZipEntry</code> as soon as the first central
         directory header of the archive was found, leaving the whole
         central directory itself unread inside the stream.  Starting
         with version 1.5 <code>ZipArchiveInputStream</code> will try
         to read the archive up to and including the "end of central
         directory" record effectively consuming the archive
         completely.</p>

       </subsection>

       <subsection name="Symbolic Links" id="symlinks">

         <p>Starting with Compress 1.5 <code>ZipArchiveEntry</code>
         recognizes Unix Symbolic Link entries written by InfoZIP's
         zip.</p>

         <p>The <code>ZipFile</code> class contains a convenience
         method to read the link name of an entry.  Basically all it
         does is read the contents of the entry and convert it to
         a string using the given file name encoding of the
         archive.</p>

       </subsection>

       <subsection name="Parallel zip creation" id="parallel">

         <p>Starting with Compress 1.10 there is now built-in support for
           parallel creation of zip archives</p>

           <p>Multiple threads can write
           to their own <code>ScatterZipOutputStream</code>
           instance that is backed to file or to some user-implemented form of
           storage (implementing <code>ScatterGatherBackingStore</code>).</p>

           <p>When the threads finish, they can join these streams together
           to a complete zip file using the <code>writeTo</code> method
           that will write a single <code>ScatterOutputStream</code> to a target
           <code>ZipArchiveOutputStream</code>.</p>

           <p>To assist this process, clients can use
           <code>ParallelScatterZipCreator</code> that will handle threads
           pools and correct memory model consistency so the client
           can avoid these issues. Please note that when writing well-formed
           Zip files this way, it is usually necessary to keep a
           separate <code>ScatterZipOutputStream</code> that receives all directories
           and writes this to the target <code>ZipArchiveOutputStream</code> before
           the ones created through <code>ParallelScatterZipCreator</code>. This is the responsibility of the client.</p>

           <p>There is no guarantee of order of the entries when writing a Zip
           file with <code>ParallelScatterZipCreator</code>.</p>

           See the examples section for a code sample demonstrating how to make a zip file.
       </subsection>

     </section>
   </body>
 </document>