blob: 59ff5dcb8c528c538eb6a35bd3366aab45262e4b [file] [log] [blame]
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<document>
<properties>
<title>Commons Compress ZIP package</title>
<author email="dev@commons.apache.org">Commons Documentation Team</author>
</properties>
<body>
<section name="The ZIP package">
<p>The ZIP package provides features not found
in <code>java.util.zip</code>:</p>
<ul>
<li>Support for encodings other than UTF-8 for filenames and
comments. Starting with Java7 this is supported
by <code>java.util.zip</code> as well.</li>
<li>Access to internal and external attributes (which are used
to store Unix permission by some zip implementations).</li>
<li>Structured support for extra fields.</li>
</ul>
<p>In addition to the information stored
in <code>ArchiveEntry</code> a <code>ZipArchiveEntry</code>
stores internal and external attributes as well as extra
fields which may contain information like Unix permissions,
information about the platform they've been created on, their
last modification time and an optional comment.</p>
<subsection name="ZipArchiveInputStream vs ZipFile">
<p>ZIP archives store a archive entries in sequence and
contain a registry of all entries at the very end of the
archive. It is acceptable for an archive to contain several
entries of the same name and have the registry (called the
central directory) decide which entry is actually to be used
(if any).</p>
<p>In addition the ZIP format stores certain information only
inside the central directory but not together with the entry
itself, this is:</p>
<ul>
<li>internal and external attributes</li>
<li>different or additional extra fields</li>
</ul>
<p>This means the ZIP format cannot really be parsed
correctly while reading a non-seekable stream, which is what
<code>ZipArchiveInputStream</code> is forced to do. As a
result <code>ZipArchiveInputStream</code></p>
<ul>
<li>may return entries that are not part of the central
directory at all and shouldn't be considered part of the
archive.</li>
<li>may return several entries with the same name.</li>
<li>will not return internal or external attributes.</li>
<li>may return incomplete extra field data.</li>
<li>may return unknown sizes and CRC values for entries
until the next entry has been reached if the archive uses
the data descriptor feature (see below).</li>
</ul>
<p><code>ZipArchiveInputStream</code> shares these limitations
with <code>java.util.zip.ZipInputStream</code>.</p>
<p><code>ZipFile</code> is able to read the central directory
first and provide correct and complete information on any
ZIP archive.</p>
<p>ZIP archives know a feature called the data descriptor
which is a way to store an entry's length after the entry's
data. This can only work reliably if the size information
can be taken from the central directory or the data itself
can signal it is complete, which is true for data that is
compressed using the DEFLATED compression algorithm.</p>
<p><code>ZipFile</code> has access to the central directory
and can extract entries using the data descriptor reliably.
The same is true for <code>ZipArchiveInputStream</code> as
long as the entry is DEFLATED. For STORED
entries <code>ZipArchiveInputStream</code> can try to read
ahead until it finds the next entry, but this approach is
not safe and has to be enabled by a constructor argument
explicitly.</p>
<p>If possible, you should always prefer <code>ZipFile</code>
over <code>ZipArchiveInputStream</code>.</p>
<p><code>ZipFile</code> requires a
<code>SeekableByteChannel</code> that will be obtained
transparently when reading from a file. The class
<code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
allows you to read from an in-memory archive.</p>
</subsection>
<subsection name="ZipArchiveOutputStream" id="ZipArchiveOutputStream">
<p><code>ZipArchiveOutputStream</code> has three constructors,
one of them uses a <code>File</code> argument, one a
<code>SeekableByteChannel</code> and the last uses an
<code>OutputStream</code>. The <code>File</code> version will
try to use <code>SeekableByteChannel</code> and fall back to
using a <code>FileOutputStream</code> internally if that
fails.</p>
<p>If <code>ZipArchiveOutputStream</code> can
use <code>SeekableByteChannel</code> it can employ some
optimizations that lead to smaller archives. It also makes
it possible to add uncompressed (<code>setMethod</code> used
with <code>STORED</code>) entries of unknown size when
calling <code>putArchiveEntry</code> - this is not allowed
if <code>ZipArchiveOutputStream</code> has to use
an <code>OutputStream</code>.</p>
<p>If you know you are writing to a file, you should always
prefer the <code>File</code>- or
<code>SeekableByteChannel</code>-arg constructors. The class
<code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
allows you to write to an in-memory archive.</p>
</subsection>
<subsection name="Extra Fields">
<p>Inside a ZIP archive, additional data can be attached to
each entry. The <code>java.util.zip.ZipEntry</code> class
provides access to this via the <code>get/setExtra</code>
methods as arrays of <code>byte</code>s.</p>
<p>Actually the extra data is supposed to be more structured
than that and Compress' ZIP package provides access to the
structured data as <code>ExtraField</code> instances. Only
a subset of all defined extra field formats is supported by
the package, any other extra field will be stored
as <code>UnrecognizedExtraField</code>.</p>
<p>Prior to version 1.1 of this library trying to read an
archive with extra fields that didn't follow the recommended
structure for those fields would cause Compress to throw an
exception. Starting with version 1.1 these extra fields
will now be read
as <code>UnparseableExtraFieldData</code>.</p>
</subsection>
<subsection name="Encoding" id="encoding">
<p>Traditionally the ZIP archive format uses CodePage 437 as
encoding for file name, which is not sufficient for many
international character sets.</p>
<p>Over time different archivers have chosen different ways to
work around the limitation - the <code>java.util.zip</code>
packages simply uses UTF-8 as its encoding for example.</p>
<p>Ant has been offering the encoding attribute of the zip and
unzip task as a way to explicitly specify the encoding to
use (or expect) since Ant 1.4. It defaults to the
platform's default encoding for zip and UTF-8 for jar and
other jar-like tasks (war, ear, ...) as well as the unzip
family of tasks.</p>
<p>More recent versions of the ZIP specification introduce
something called the &quot;language encoding flag&quot;
which can be used to signal that a file name has been
encoded using UTF-8. All ZIP-archives written by Compress
will set this flag, if the encoding has been set to UTF-8.
Our interoperability tests with existing archivers didn't
show any ill effects (in fact, most archivers ignore the
flag to date), but you can turn off the "language encoding
flag" by setting the attribute
<code>useLanguageEncodingFlag</code> to <code>false</code> on the
<code>ZipArchiveOutputStream</code> if you should encounter
problems.</p>
<p>The <code>ZipFile</code>
and <code>ZipArchiveInputStream</code> classes will
recognize the language encoding flag and ignore the encoding
set in the constructor if it has been found.</p>
<p>The InfoZIP developers have introduced new ZIP extra fields
that can be used to add an additional UTF-8 encoded file
name to the entry's metadata. Most archivers ignore these
extra fields. <code>ZipArchiveOutputStream</code> supports
an option <code>createUnicodeExtraFields</code> which makes
it write these extra fields either for all entries
("always") or only those whose name cannot be encoded using
the specified encoding (not-encodeable), it defaults to
"never" since the extra fields create bigger archives.</p>
<p>The fallbackToUTF8 attribute
of <code>ZipArchiveOutputStream</code> can be used to create
archives that use the specified encoding in the majority of
cases but UTF-8 and the language encoding flag for filenames
that cannot be encoded using the specified encoding.</p>
<p>The <code>ZipFile</code>
and <code>ZipArchiveInputStream</code> classes recognize the
Unicode extra fields by default and read the file name
information from them, unless you set the constructor parameter
<code>scanForUnicodeExtraFields</code> to false.</p>
<h4>Recommendations for Interoperability</h4>
<p>The optimal setting of flags depends on the archivers you
expect as consumers/producers of the ZIP archives. Below
are some test results which may be superseded with later
versions of each tool.</p>
<ul>
<li>The java.util.zip package used by the jar executable or
to read jars from your CLASSPATH reads and writes UTF-8
names, it doesn't set or recognize any flags or Unicode
extra fields.</li>
<li>Starting with Java7 <code>java.util.zip</code> writes
UTF-8 by default and uses the language encoding flag. It
is possible to specify a different encoding when
reading/writing ZIPs via new constructors. The package
now recognizes the language encoding flag when reading and
ignores the Unicode extra fields.</li>
<li>7Zip writes CodePage 437 by default but uses UTF-8 and
the language encoding flag when writing entries that
cannot be encoded as CodePage 437 (similar to the zip task
with fallbacktoUTF8 set to true). It recognizes the
language encoding flag when reading and ignores the
Unicode extra fields.</li>
<li>WinZIP writes CodePage 437 and uses Unicode extra fields
by default. It recognizes the Unicode extra field and the
language encoding flag when reading.</li>
<li>Windows' "compressed folder" feature doesn't recognize
any flag or extra field and creates archives using the
platforms default encoding - and expects archives to be in
that encoding when reading them.</li>
<li>InfoZIP based tools can recognize and write both, it is
a compile time option and depends on the platform so your
mileage may vary.</li>
<li>PKWARE zip tools recognize both and prefer the language
encoding flag. They create archives using CodePage 437 if
possible and UTF-8 plus the language encoding flag for
file names that cannot be encoded as CodePage 437.</li>
</ul>
<p>So, what to do?</p>
<p>If you are creating jars, then java.util.zip is your main
consumer. We recommend you set the encoding to UTF-8 and
keep the language encoding flag enabled. The flag won't
help or hurt java.util.zip prior to Java7 but archivers that
support it will show the correct file names.</p>
<p>For maximum interop it is probably best to set the encoding
to UTF-8, enable the language encoding flag and create
Unicode extra fields when writing ZIPs. Such archives
should be extracted correctly by java.util.zip, 7Zip,
WinZIP, PKWARE tools and most likely InfoZIP tools. They
will be unusable with Windows' "compressed folders" feature
and bigger than archives without the Unicode extra fields,
though.</p>
<p>If Windows' "compressed folders" is your primary consumer,
then your best option is to explicitly set the encoding to
the target platform. You may want to enable creation of
Unicode extra fields so the tools that support them will
extract the file names correctly.</p>
</subsection>
<subsection name="Encryption and Alternative Compression Algorithms"
id="encryption">
<p>In most cases entries of an archive are not encrypted and
are either not compressed at all or use the DEFLATE
algorithm, Commons Compress' ZIP archiver will handle them
just fine. As of version 1.7, Commons Compress can also
decompress entries compressed with the legacy SHRINK and
IMPLODE algorithms of PKZIP 1.x. Version 1.11 of Commons
Compress adds read-only support for BZIP2. Version 1.16 adds
read-only support for DEFLATE64 - also known as "enhanced DEFLATE".</p>
<p>The ZIP specification allows for various other compression
algorithms and also supports several different ways of
encrypting archive contents. Neither of those methods is
currently supported by Commons Compress and any such entry can
not be extracted by the archiving code.</p>
<p><code>ZipFile</code>'s and
<code>ZipArchiveInputStream</code>'s
<code>canReadEntryData</code> methods will return false for
encrypted entries or entries using an unsupported encryption
mechanism. Using this method it is possible to at least
detect and skip the entries that can not be extracted.</p>
<table>
<thead>
<tr>
<th>Version of Apache Commons Compress</th>
<th>Supported Compression Methods</th>
<th>Supported Encryption Methods</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.0 to 1.6</td>
<td>STORED, DEFLATE</td>
<td>-</td>
</tr>
<tr>
<td>1.7 to 1.10</td>
<td>STORED, DEFLATE, SHRINK, IMPLODE</td>
<td>-</td>
</tr>
<tr>
<td>1.11 to 1.15</td>
<td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2</td>
<td>-</td>
</tr>
<tr>
<td>1.16 and later</td>
<td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2, DEFLATE64
(enhanced deflate)</td>
<td>-</td>
</tr>
</tbody>
</table>
</subsection>
<subsection name="Zip64 Support" id="zip64">
<p>The traditional ZIP format is limited to archive sizes of
four gibibyte (actually 2<sup>32</sup> - 1 bytes &#x2248;
4.3 GB) and 65635 entries, where each individual entry is
limited to four gibibyte as well. These limits seemed
excessive in the 1980s.</p>
<p>Version 4.5 of the ZIP specification introduced the so
called "Zip64 extensions" to push those limitations for
compressed or uncompressed sizes of up to 16 exbibyte
(actually 2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB, i.e
18.5 x 10<sup>18</sup> bytes) in archives that themselves
can take up to 16 exbibyte containing more than
18 x 10<sup>18</sup> entries.</p>
<p>Apache Commons Compress 1.2 and earlier do not support
Zip64 extensions at all.</p>
<p>Starting with Apache Commons Compress
1.3 <code>ZipArchiveInputStream</code>
and <code>ZipFile</code> transparently support Zip64
extensions. By default <code>ZipArchiveOutputStream</code>
supports them transparently as well (i.e. it adds Zip64
extensions if needed and doesn't use them for
entries/archives that don't need them) if the compressed and
uncompressed sizes of the entry are known
when <code>putArchiveEntry</code> is called
or <code>ZipArchiveOutputStream</code>
uses <code>SeekableByteChannel</code>
(see <a href="#ZipArchiveOutputStream">above</a>). If only
the uncompressed size is
known <code>ZipArchiveOutputStream</code> will assume the
compressed size will not be bigger than the uncompressed
size.</p>
<p><code>ZipArchiveOutputStream</code>'s
<code>setUseZip64</code> can be used to control the behavior.
<code>Zip64Mode.AsNeeded</code> is the default behavior
described in the previous paragraph.</p>
<p>If <code>ZipArchiveOutputStream</code> is writing to a
non-seekable stream it has to decide whether to use Zip64
extensions or not before it starts wrtiting the entry data.
This means that if the size of the entry is unknown
when <code>putArchiveEntry</code> is called it doesn't have
anything to base the decision on. By default it will not
use Zip64 extensions in order to create archives that can be
extracted by older archivers (it will later throw an
exception in <code>closeEntry</code> if it detects Zip64
extensions had been needed). It is possible to
instruct <code>ZipArchiveOutputStream</code> to always
create Zip64 extensions by using
the <code>setUseZip64</code> with an argument
of <code>Zip64Mode.Always</code>; use this if you are
writing entries of unknown size to a stream and expect some
of them to be too big to fit into the traditional
limits.</p>
<p><code>Zip64Mode.Always</code> creates archives that use
Zip64 extensions for all entries, even those that don't
require them. Such archives will be slightly bigger than
archives created with one of the other modes and not be
readable by unarchivers that don't support Zip64
extensions.</p>
<p><code>Zip64Mode.Never</code> will not use any Zip64
extensions at all and may lead to
a <code>Zip64RequiredException</code> to be thrown
if <code>ZipArchiveOutputStream</code> detects that one of
the format's limits is exceeded. Archives created in this
mode will be readable by all unarchivers; they may be
slightly smaller than archives created
with <code>SeekableByteChannel</code>
in <code>Zip64Mode.AsNeeded</code> mode if some of the
entries had unknown sizes.</p>
<p>The <code>java.util.zip</code> package and the
<code>jar</code> command of Java5 and earlier can not read
Zip64 extensions and will fail if the archive contains any.
So if you intend to create archives that Java5 can consume
you must set the mode to <code>Zip64Mode.Never</code></p>
<h4>Known Limitations</h4>
<p>Some of the theoretical limits of the format are not
reached because Apache Commons Compress' own API
(<code>ArchiveEntry</code>'s size information uses
a <code>long</code>) or its usage of Java collections
or <code>SeekableByteChannel</code> internally. The table
below shows the theoretical limits supported by Apache
Commons Compress. In practice it is very likely that you'd
run out of memory or your file system won't allow files that
big long before you reach either limit.</p>
<table>
<thead>
<tr>
<th/>
<th>Max. Size of Archive</th>
<th>Max. Compressed/Uncompressed Size of Entry</th>
<th>Max. Number of Entries</th>
</tr>
</thead>
<tbody>
<tr>
<td>ZIP Format Without Zip 64 Extensions</td>
<td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
<td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
<td>65535</td>
</tr>
<tr>
<td>ZIP Format using Zip 64 Extensions</td>
<td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
<td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
<td>2<sup>64</sup> - 1 &#x2248; 18.5 x 10<sup>18</sup></td>
</tr>
<tr>
<td>Commons Compress 1.2 and earlier</td>
<td>unlimited in <code>ZipArchiveInputStream</code>
and <code>ZipArchiveOutputStream</code> and
2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB
in <code>ZipFile</code>.</td>
<td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
<td>unlimited in <code>ZipArchiveInputStream</code>,
65535 in <code>ZipArchiveOutputStream</code>
and <code>ZipFile</code>.</td>
</tr>
<tr>
<td>Commons Compress 1.3 and later</td>
<td>unlimited in <code>ZipArchiveInputStream</code>
and <code>ZipArchiveOutputStream</code> and
2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB
in <code>ZipFile</code>.</td>
<td>2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB</td>
<td>unlimited in <code>ZipArchiveInputStream</code>,
2<sup>31</sup> - 1 &#x2248; 2.1 billion
in <code>ZipArchiveOutputStream</code>
and <code>ZipFile</code>.</td>
</tr>
</tbody>
</table>
<h4>Known Interoperability Problems</h4>
<p>The <code>java.util.zip</code> package of OpenJDK7 supports
Zip 64 extensions but its <code>ZipInputStream</code> and
<code>ZipFile</code> classes will be unable to extract
archives created with Commons Compress 1.3's
<code>ZipArchiveOutputStream</code> if the archive contains
entries that use the data descriptor, are smaller than 4 GiB
and have Zip 64 extensions enabled. I.e. the classes in
OpenJDK currently only support archives that use Zip 64
extensions only when they are actually needed. These classes
are used to load JAR files and are the base for the
<code>jar</code> command line utility as well.</p>
</subsection>
<subsection name="Consuming Archives Completely">
<p>Prior to version 1.5 <code>ZipArchiveInputStream</code>
would return null from <code>getNextEntry</code> or
<code>getNextZipEntry</code> as soon as the first central
directory header of the archive was found, leaving the whole
central directory itself unread inside the stream. Starting
with version 1.5 <code>ZipArchiveInputStream</code> will try
to read the archive up to and including the "end of central
directory" record effectively consuming the archive
completely.</p>
</subsection>
<subsection name="Symbolic Links" id="symlinks">
<p>Starting with Compress 1.5 <code>ZipArchiveEntry</code>
recognizes Unix Symbolic Link entries written by InfoZIP's
zip.</p>
<p>The <code>ZipFile</code> class contains a convenience
method to read the link name of an entry. Basically all it
does is read the contents of the entry and convert it to
a string using the given file name encoding of the
archive.</p>
</subsection>
<subsection name="Parallel zip creation" id="parallel">
<p>Starting with Compress 1.10 there is now built-in support for
parallel creation of zip archives</p>
<p>Multiple threads can write
to their own <code>ScatterZipOutputStream</code>
instance that is backed to file or to some user-implemented form of
storage (implementing <code>ScatterGatherBackingStore</code>).</p>
<p>When the threads finish, they can join these streams together
to a complete zip file using the <code>writeTo</code> method
that will write a single <code>ScatterOutputStream</code> to a target
<code>ZipArchiveOutputStream</code>.</p>
<p>To assist this process, clients can use
<code>ParallelScatterZipCreator</code> that will handle threads
pools and correct memory model consistency so the client
can avoid these issues. Please note that when writing well-formed
Zip files this way, it is usually necessary to keep a
separate <code>ScatterZipOutputStream</code> that receives all directories
and writes this to the target <code>ZipArchiveOutputStream</code> before
the ones created through <code>ParallelScatterZipCreator</code>. This is the responsibility of the client.</p>
<p>There is no guarantee of order of the entries when writing a Zip
file with <code>ParallelScatterZipCreator</code>.</p>
See the examples section for a code sample demonstrating how to make a zip file.
</subsection>
</section>
</body>
</document>