| |
| =head1 NAME |
| |
| IO::Compress::FAQ -- Frequently Asked Questions about IO::Compress |
| |
| =head1 DESCRIPTION |
| |
| Common questions answered. |
| |
| =head1 GENERAL |
| |
| =head2 Compatibility with Unix compress/uncompress. |
| |
| Although C<Compress::Zlib> has a pair of functions called C<compress> and |
| C<uncompress>, they are I<not> related to the Unix programs of the same |
| name. The C<Compress::Zlib> module is not compatible with Unix |
| C<compress>. |
| |
| If you have the C<uncompress> program available, you can use this to read |
| compressed files |
| |
| open F, "uncompress -c $filename |"; |
| while (<F>) |
| { |
| ... |
| |
| Alternatively, if you have the C<gunzip> program available, you can use |
| this to read compressed files |
| |
| open F, "gunzip -c $filename |"; |
| while (<F>) |
| { |
| ... |
| |
| and this to write compress files, if you have the C<compress> program |
| available |
| |
| open F, "| compress -c $filename "; |
| print F "data"; |
| ... |
| close F ; |
| |
| =head2 Accessing .tar.Z files |
| |
| The C<Archive::Tar> module can optionally use C<Compress::Zlib> (via the |
| C<IO::Zlib> module) to access tar files that have been compressed with |
| C<gzip>. Unfortunately tar files compressed with the Unix C<compress> |
| utility cannot be read by C<Compress::Zlib> and so cannot be directly |
| accessed by C<Archive::Tar>. |
| |
| If the C<uncompress> or C<gunzip> programs are available, you can use one |
| of these workarounds to read C<.tar.Z> files from C<Archive::Tar> |
| |
| Firstly with C<uncompress> |
| |
| use strict; |
| use warnings; |
| use Archive::Tar; |
| |
| open F, "uncompress -c $filename |"; |
| my $tar = Archive::Tar->new(*F); |
| ... |
| |
| and this with C<gunzip> |
| |
| use strict; |
| use warnings; |
| use Archive::Tar; |
| |
| open F, "gunzip -c $filename |"; |
| my $tar = Archive::Tar->new(*F); |
| ... |
| |
| Similarly, if the C<compress> program is available, you can use this to |
| write a C<.tar.Z> file |
| |
| use strict; |
| use warnings; |
| use Archive::Tar; |
| use IO::File; |
| |
| my $fh = new IO::File "| compress -c >$filename"; |
| my $tar = Archive::Tar->new(); |
| ... |
| $tar->write($fh); |
| $fh->close ; |
| |
| =head2 How do I recompress using a different compression? |
| |
| This is easier that you might expect if you realise that all the |
| C<IO::Compress::*> objects are derived from C<IO::File> and that all the |
| C<IO::Uncompress::*> modules can read from an C<IO::File> filehandle. |
| |
| So, for example, say you have a file compressed with gzip that you want to |
| recompress with bzip2. Here is all that is needed to carry out the |
| recompression. |
| |
| use IO::Uncompress::Gunzip ':all'; |
| use IO::Compress::Bzip2 ':all'; |
| |
| my $gzipFile = "somefile.gz"; |
| my $bzipFile = "somefile.bz2"; |
| |
| my $gunzip = new IO::Uncompress::Gunzip $gzipFile |
| or die "Cannot gunzip $gzipFile: $GunzipError\n" ; |
| |
| bzip2 $gunzip => $bzipFile |
| or die "Cannot bzip2 to $bzipFile: $Bzip2Error\n" ; |
| |
| Note, there is a limitation of this technique. Some compression file |
| formats store extra information along with the compressed data payload. For |
| example, gzip can optionally store the original filename and Zip stores a |
| lot of information about the original file. If the original compressed file |
| contains any of this extra information, it will not be transferred to the |
| new compressed file usign the technique above. |
| |
| =head1 ZIP |
| |
| =head2 What Compression Types do IO::Compress::Zip & IO::Uncompress::Unzip support? |
| |
| The following compression formats are supported by C<IO::Compress::Zip> and |
| C<IO::Uncompress::Unzip> |
| |
| =over 5 |
| |
| =item * Store (method 0) |
| |
| No compression at all. |
| |
| =item * Deflate (method 8) |
| |
| This is the default compression used when creating a zip file with |
| C<IO::Compress::Zip>. |
| |
| =item * Bzip2 (method 12) |
| |
| Only supported if the C<IO-Compress-Bzip2> module is installed. |
| |
| =item * Lzma (method 14) |
| |
| Only supported if the C<IO-Compress-Lzma> module is installed. |
| |
| =back |
| |
| =head2 Can I Read/Write Zip files larger the 4 Gig? |
| |
| Yes, both the C<IO-Compress-Zip> and C<IO-Uncompress-Unzip> modules |
| support the zip feature called I<Zip64>. That allows them to read/write |
| files/buffers larger than 4Gig. |
| |
| If you are creating a Zip file using the one-shot interface, and any of the |
| input files is greater than 4Gig, a zip64 complaint zip file will be |
| created. |
| |
| zip "really-large-file" => "my.zip"; |
| |
| Similarly with the one-shot interface, if the input is a buffer larger than |
| 4 Gig, a zip64 complaint zip file will be created. |
| |
| zip \$really_large_buffer => "my.zip"; |
| |
| The one-shot interface allows you to force the creation of a zip64 zip file |
| by including the C<Zip64> option. |
| |
| zip $filehandle => "my.zip", Zip64 => 1; |
| |
| If you want to create a zip64 zip file with the OO interface you must |
| specify the C<Zip64> option. |
| |
| my $zip = new IO::Compress::Zip "whatever", Zip64 => 1; |
| |
| When uncompressing with C<IO-Uncompress-Unzip>, it will automatically |
| detect if the zip file is zip64. |
| |
| If you intend to manipulate the Zip64 zip files created with |
| C<IO-Compress-Zip> using an external zip/unzip, make sure that it supports |
| Zip64. |
| |
| In particular, if you are using Info-Zip you need to have zip version 3.x |
| or better to update a Zip64 archive and unzip version 6.x to read a zip64 |
| archive. |
| |
| =head2 Zip Resources |
| |
| The primary reference for zip files is the "appnote" document available at |
| L<http://www.pkware.com/documents/casestudies/APPNOTE.TXT> |
| |
| An alternatively is the Info-Zip appnote. This is available from |
| L<ftp://ftp.info-zip.org/pub/infozip/doc/> |
| |
| =head1 GZIP |
| |
| =head2 Gzip Resources |
| |
| The primary reference for gzip files is RFC 1952 |
| L<http://www.faqs.org/rfcs/rfc1952.html> |
| |
| The primary site for gzip is F<http://www.gzip.org>. |
| |
| =head1 ZLIB |
| |
| =head2 Zlib Resources |
| |
| The primary site for the I<zlib> compression library is |
| F<http://www.zlib.org>. |
| |
| =head1 HTTP & NETWORK |
| |
| =head2 Apache::GZip Revisited |
| |
| Below is a mod_perl Apache compression module, called C<Apache::GZip>, |
| taken from |
| F<http://perl.apache.org/docs/tutorials/tips/mod_perl_tricks/mod_perl_tricks.html#On_the_Fly_Compression> |
| |
| package Apache::GZip; |
| #File: Apache::GZip.pm |
| |
| use strict vars; |
| use Apache::Constants ':common'; |
| use Compress::Zlib; |
| use IO::File; |
| use constant GZIP_MAGIC => 0x1f8b; |
| use constant OS_MAGIC => 0x03; |
| |
| sub handler { |
| my $r = shift; |
| my ($fh,$gz); |
| my $file = $r->filename; |
| return DECLINED unless $fh=IO::File->new($file); |
| $r->header_out('Content-Encoding'=>'gzip'); |
| $r->send_http_header; |
| return OK if $r->header_only; |
| |
| tie *STDOUT,'Apache::GZip',$r; |
| print($_) while <$fh>; |
| untie *STDOUT; |
| return OK; |
| } |
| |
| sub TIEHANDLE { |
| my($class,$r) = @_; |
| # initialize a deflation stream |
| my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef; |
| |
| # gzip header -- don't ask how I found out |
| $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC)); |
| |
| return bless { r => $r, |
| crc => crc32(undef), |
| d => $d, |
| l => 0 |
| },$class; |
| } |
| |
| sub PRINT { |
| my $self = shift; |
| foreach (@_) { |
| # deflate the data |
| my $data = $self->{d}->deflate($_); |
| $self->{r}->print($data); |
| # keep track of its length and crc |
| $self->{l} += length($_); |
| $self->{crc} = crc32($_,$self->{crc}); |
| } |
| } |
| |
| sub DESTROY { |
| my $self = shift; |
| |
| # flush the output buffers |
| my $data = $self->{d}->flush; |
| $self->{r}->print($data); |
| |
| # print the CRC and the total length (uncompressed) |
| $self->{r}->print(pack("LL",@{$self}{qw/crc l/})); |
| } |
| |
| 1; |
| |
| Here's the Apache configuration entry you'll need to make use of it. Once |
| set it will result in everything in the /compressed directory will be |
| compressed automagically. |
| |
| <Location /compressed> |
| SetHandler perl-script |
| PerlHandler Apache::GZip |
| </Location> |
| |
| Although at first sight there seems to be quite a lot going on in |
| C<Apache::GZip>, you could sum up what the code was doing as follows -- |
| read the contents of the file in C<< $r->filename >>, compress it and write |
| the compressed data to standard output. That's all. |
| |
| This code has to jump through a few hoops to achieve this because |
| |
| =over |
| |
| =item 1. |
| |
| The gzip support in C<Compress::Zlib> version 1.x can only work with a real |
| filesystem filehandle. The filehandles used by Apache modules are not |
| associated with the filesystem. |
| |
| =item 2. |
| |
| That means all the gzip support has to be done by hand - in this case by |
| creating a tied filehandle to deal with creating the gzip header and |
| trailer. |
| |
| =back |
| |
| C<IO::Compress::Gzip> doesn't have that filehandle limitation (this was one |
| of the reasons for writing it in the first place). So if |
| C<IO::Compress::Gzip> is used instead of C<Compress::Zlib> the whole tied |
| filehandle code can be removed. Here is the rewritten code. |
| |
| package Apache::GZip; |
| |
| use strict vars; |
| use Apache::Constants ':common'; |
| use IO::Compress::Gzip; |
| use IO::File; |
| |
| sub handler { |
| my $r = shift; |
| my ($fh,$gz); |
| my $file = $r->filename; |
| return DECLINED unless $fh=IO::File->new($file); |
| $r->header_out('Content-Encoding'=>'gzip'); |
| $r->send_http_header; |
| return OK if $r->header_only; |
| |
| my $gz = new IO::Compress::Gzip '-', Minimal => 1 |
| or return DECLINED ; |
| |
| print $gz $_ while <$fh>; |
| |
| return OK; |
| } |
| |
| or even more succinctly, like this, using a one-shot gzip |
| |
| package Apache::GZip; |
| |
| use strict vars; |
| use Apache::Constants ':common'; |
| use IO::Compress::Gzip qw(gzip); |
| |
| sub handler { |
| my $r = shift; |
| $r->header_out('Content-Encoding'=>'gzip'); |
| $r->send_http_header; |
| return OK if $r->header_only; |
| |
| gzip $r->filename => '-', Minimal => 1 |
| or return DECLINED ; |
| |
| return OK; |
| } |
| |
| 1; |
| |
| The use of one-shot C<gzip> above just reads from C<< $r->filename >> and |
| writes the compressed data to standard output. |
| |
| Note the use of the C<Minimal> option in the code above. When using gzip |
| for Content-Encoding you should I<always> use this option. In the example |
| above it will prevent the filename being included in the gzip header and |
| make the size of the gzip data stream a slight bit smaller. |
| |
| =head2 Compressed files and Net::FTP |
| |
| The C<Net::FTP> module provides two low-level methods called C<stor> and |
| C<retr> that both return filehandles. These filehandles can used with the |
| C<IO::Compress/Uncompress> modules to compress or uncompress files read |
| from or written to an FTP Server on the fly, without having to create a |
| temporary file. |
| |
| Firstly, here is code that uses C<retr> to uncompressed a file as it is |
| read from the FTP Server. |
| |
| use Net::FTP; |
| use IO::Uncompress::Gunzip qw(:all); |
| |
| my $ftp = new Net::FTP ... |
| |
| my $retr_fh = $ftp->retr($compressed_filename); |
| gunzip $retr_fh => $outFilename, AutoClose => 1 |
| or die "Cannot uncompress '$compressed_file': $GunzipError\n"; |
| |
| and this to compress a file as it is written to the FTP Server |
| |
| use Net::FTP; |
| use IO::Compress::Gzip qw(:all); |
| |
| my $stor_fh = $ftp->stor($filename); |
| gzip "filename" => $stor_fh, AutoClose => 1 |
| or die "Cannot compress '$filename': $GzipError\n"; |
| |
| =head1 MISC |
| |
| =head2 Using C<InputLength> to uncompress data embedded in a larger file/buffer. |
| |
| A fairly common use-case is where compressed data is embedded in a larger |
| file/buffer and you want to read both. |
| |
| As an example consider the structure of a zip file. This is a well-defined |
| file format that mixes both compressed and uncompressed sections of data in |
| a single file. |
| |
| For the purposes of this discussion you can think of a zip file as sequence |
| of compressed data streams, each of which is prefixed by an uncompressed |
| local header. The local header contains information about the compressed |
| data stream, including the name of the compressed file and, in particular, |
| the length of the compressed data stream. |
| |
| To illustrate how to use C<InputLength> here is a script that walks a zip |
| file and prints out how many lines are in each compressed file (if you |
| intend write code to walking through a zip file for real see |
| L<IO::Uncompress::Unzip/"Walking through a zip file"> ). Also, although |
| this example uses the zlib-based compression, the technique can be used by |
| the other C<IO::Uncompress::*> modules. |
| |
| use strict; |
| use warnings; |
| |
| use IO::File; |
| use IO::Uncompress::RawInflate qw(:all); |
| |
| use constant ZIP_LOCAL_HDR_SIG => 0x04034b50; |
| use constant ZIP_LOCAL_HDR_LENGTH => 30; |
| |
| my $file = $ARGV[0] ; |
| |
| my $fh = new IO::File "<$file" |
| or die "Cannot open '$file': $!\n"; |
| |
| while (1) |
| { |
| my $sig; |
| my $buffer; |
| |
| my $x ; |
| ($x = $fh->read($buffer, ZIP_LOCAL_HDR_LENGTH)) == ZIP_LOCAL_HDR_LENGTH |
| or die "Truncated file: $!\n"; |
| |
| my $signature = unpack ("V", substr($buffer, 0, 4)); |
| |
| last unless $signature == ZIP_LOCAL_HDR_SIG; |
| |
| # Read Local Header |
| my $gpFlag = unpack ("v", substr($buffer, 6, 2)); |
| my $compressedMethod = unpack ("v", substr($buffer, 8, 2)); |
| my $compressedLength = unpack ("V", substr($buffer, 18, 4)); |
| my $uncompressedLength = unpack ("V", substr($buffer, 22, 4)); |
| my $filename_length = unpack ("v", substr($buffer, 26, 2)); |
| my $extra_length = unpack ("v", substr($buffer, 28, 2)); |
| |
| my $filename ; |
| $fh->read($filename, $filename_length) == $filename_length |
| or die "Truncated file\n"; |
| |
| $fh->read($buffer, $extra_length) == $extra_length |
| or die "Truncated file\n"; |
| |
| if ($compressedMethod != 8 && $compressedMethod != 0) |
| { |
| warn "Skipping file '$filename' - not deflated $compressedMethod\n"; |
| $fh->read($buffer, $compressedLength) == $compressedLength |
| or die "Truncated file\n"; |
| next; |
| } |
| |
| if ($compressedMethod == 0 && $gpFlag & 8 == 8) |
| { |
| die "Streamed Stored not supported for '$filename'\n"; |
| } |
| |
| next if $compressedLength == 0; |
| |
| # Done reading the Local Header |
| |
| my $inf = new IO::Uncompress::RawInflate $fh, |
| Transparent => 1, |
| InputLength => $compressedLength |
| or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
| |
| my $line_count = 0; |
| |
| while (<$inf>) |
| { |
| ++ $line_count; |
| } |
| |
| print "$filename: $line_count\n"; |
| } |
| |
| The majority of the code above is concerned with reading the zip local |
| header data. The code that I want to focus on is at the bottom. |
| |
| while (1) { |
| |
| # read local zip header data |
| # get $filename |
| # get $compressedLength |
| |
| my $inf = new IO::Uncompress::RawInflate $fh, |
| Transparent => 1, |
| InputLength => $compressedLength |
| or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
| |
| my $line_count = 0; |
| |
| while (<$inf>) |
| { |
| ++ $line_count; |
| } |
| |
| print "$filename: $line_count\n"; |
| } |
| |
| The call to C<IO::Uncompress::RawInflate> creates a new filehandle C<$inf> |
| that can be used to read from the parent filehandle C<$fh>, uncompressing |
| it as it goes. The use of the C<InputLength> option will guarantee that |
| I<at most> C<$compressedLength> bytes of compressed data will be read from |
| the C<$fh> filehandle (The only exception is for an error case like a |
| truncated file or a corrupt data stream). |
| |
| This means that once RawInflate is finished C<$fh> will be left at the |
| byte directly after the compressed data stream. |
| |
| Now consider what the code looks like without C<InputLength> |
| |
| while (1) { |
| |
| # read local zip header data |
| # get $filename |
| # get $compressedLength |
| |
| # read all the compressed data into $data |
| read($fh, $data, $compressedLength); |
| |
| my $inf = new IO::Uncompress::RawInflate \$data, |
| Transparent => 1, |
| or die "Cannot uncompress $file [$filename]: $RawInflateError\n" ; |
| |
| my $line_count = 0; |
| |
| while (<$inf>) |
| { |
| ++ $line_count; |
| } |
| |
| print "$filename: $line_count\n"; |
| } |
| |
| The difference here is the addition of the temporary variable C<$data>. |
| This is used to store a copy of the compressed data while it is being |
| uncompressed. |
| |
| If you know that C<$compressedLength> isn't that big then using temporary |
| storage won't be a problem. But if C<$compressedLength> is very large or |
| you are writing an application that other people will use, and so have no |
| idea how big C<$compressedLength> will be, it could be an issue. |
| |
| Using C<InputLength> avoids the use of temporary storage and means the |
| application can cope with large compressed data streams. |
| |
| One final point -- obviously C<InputLength> can only be used whenever you |
| know the length of the compressed data beforehand, like here with a zip |
| file. |
| |
| =head1 SEE ALSO |
| |
| L<Compress::Zlib>, L<IO::Compress::Gzip>, L<IO::Uncompress::Gunzip>, L<IO::Compress::Deflate>, L<IO::Uncompress::Inflate>, L<IO::Compress::RawDeflate>, L<IO::Uncompress::RawInflate>, L<IO::Compress::Bzip2>, L<IO::Uncompress::Bunzip2>, L<IO::Compress::Lzma>, L<IO::Uncompress::UnLzma>, L<IO::Compress::Xz>, L<IO::Uncompress::UnXz>, L<IO::Compress::Lzop>, L<IO::Uncompress::UnLzop>, L<IO::Compress::Lzf>, L<IO::Uncompress::UnLzf>, L<IO::Uncompress::AnyInflate>, L<IO::Uncompress::AnyUncompress> |
| |
| L<IO::Compress::FAQ|IO::Compress::FAQ> |
| |
| L<File::GlobMapper|File::GlobMapper>, L<Archive::Zip|Archive::Zip>, |
| L<Archive::Tar|Archive::Tar>, |
| L<IO::Zlib|IO::Zlib> |
| |
| =head1 AUTHOR |
| |
| This module was written by Paul Marquess, F<pmqs@cpan.org>. |
| |
| =head1 MODIFICATION HISTORY |
| |
| See the Changes file. |
| |
| =head1 COPYRIGHT AND LICENSE |
| |
| Copyright (c) 2005-2012 Paul Marquess. All rights reserved. |
| |
| This program is free software; you can redistribute it and/or |
| modify it under the same terms as Perl itself. |
| |