12-F.21: File Compression
- Page ID
- 43150
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Data Compression
Data compression is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.
The process of reducing the size of a data file is often referred to as data compression. In the context of data transmission, it is called source coding; encoding done at the source of the data before it is stored or transmitted. Source coding should not be confused with channel coding, for error detection and correction or line coding, the means for mapping data onto a signal.
The gzip Command
The gzip command compresses files. Each single file is compressed into a single file. The compressed file consists of a GNU zip header and deflated data.
If given a file as an argument, gzip compresses the file, adds a “.gz” suffix, and deletes the original file. With no arguments, gzip compresses the standard input and writes the compressed file to standard output.
Difference between gzip and zip commands in Unix and when to use which command
- ZIP and GZIP are two very popular methods of compressing files, in order to save space or to reduce the amount of time needed to transmit the files across the network, or internet.
- In general, GZIP is much better compared to ZIP; it is better in terms of compression, especially when compressing a huge number of files.
- The common practice with GZIP is to archive all the files into a single tarball before compression. In ZIP files, the individual files are compressed and then added to the archive.
- When you want to pull a single file from a ZIP, it is simply extracted, then decompressed. With GZIP, the whole file needs to be decompressed before you can extract the file you want from the archive.
- When pulling a 1MB file from a 10GB archive, it is quite clear that it would take a lot longer in GZIP, than in ZIP.
- GZIP’s disadvantage in how it operates is also responsible for GZIP’s advantage. Since the compression algorithm in GZIP compresses one large file instead of multiple smaller ones, it can take advantage of the redundancy in the files to reduce the file size even further.
- If you archive and compress 10 identical files with ZIP and GZIP, the ZIP file would be over 10 times bigger than the resulting GZIP file.
Syntax:
gzip [ OPTIONS ] [filenames]
Command Options:
Options | Option Meaning |
---|---|
-d | Reverse file compression (decompression). |
-r | Enable directory recursion during compression or decompression. |
-t | Perform an integrity check on the compressed file. |
-v | Display the name and percentage reduction of the compressed or decompressed file. |
In the following example, the .txt file is compressed; notice that the original file is no longer present, but copy the compressed file.
pbmac@pbmac-server $ ls myFile.txt
myFile.txt
pbmac@pbmac-server $ gzip myFile.txt
pbmac@pbmac-server $ ls myFile.txt*
myFile.txt.gz
The xz Utilities
XZ Utils is a set of free software command-line lossless data compressors, including lzma and xz, for Linux operating systems.
xz achieves higher compression rates than alternatives like gzip and bzip2. Decompression speed is faster than bzip2, but slower than gzip. Compression can be much slower than gzip, and is slower than bzip2 for high levels of compression, and is most useful when a compressed file will be used many times.
XZ Utils consists of two major components:
- xz, the command-line compressor and decompressor (analogous to gzip)
- liblzma, a software library with an API similar to zlib
Various command shortcuts exist, such as lzma (for xz --format=lzma), unxz (for xz --decompress; analogous to gunzip) and xzcat (for unxz --stdout; analogous to zcat).
XZ Utils can compress and decompress both the xz and lzma file formats, but since the lzma format is now legacy, XZ Utils compresses by default to xz.
The bzip2 Command
The bzip2 command is used to compress and decompress the files i.e. it helps in binding the files into a single file which takes less storage space than the original file used to take. It has a slower decompression time and higher memory use. It uses Burrows-Wheeler block sorting text compression algorithm, and Huffman Coding. Each file is replaced by a compressed version of itself, with the original name of the file followed by extension bz2.
bzip2 is actually part of a larger suite of commands all based around this compression algorithm. The other commands that are a part of this suite are listed in the table below:
Command | Description |
---|---|
bzip2 | Compress a file. |
bunzip2 | Decompress a file. |
bzcat | Decompress a file to standard output. |
bzdiff | Run the diff command on compressed files. |
bzip2recover | Recover data from damaged .bz2 files. |
bzless | Run the less command on compressed files. |
bzmore | Run the more command on compressed files. |
The zip Command
The zip command is a compression and file packaging utility for Unix. Each file is stored in single .zip {.zip-filename} file with the extension .zip.
- zip is used to compress the files to reduce file size and is also used as file package utility. zip is available in many operating systems like Unix, Linux, Windows, etc.
- If you have a limited bandwidth between two servers and want to transfer the files faster, then zip the files and transfer.
- The zip program puts one or more compressed files into a single zip archive, along with information about the files (name, path, date, time of last modification, protection, and check information to verify file integrity). An entire directory structure can be packed into a zip archive with a single command.
- Compression ratios of 2:1 to 3:1 are common for text files. zip has one compression method (deflation) and can also store files without compression. zip automatically chooses the better of the two for each file to be compressed.
The program is useful for packaging a set of files for distribution, for archiving files, and for saving disk space by temporarily compressing unused files or directories.
Syntax:
zip [ OPTIONS ] zipfile files_list
Command Options:
Options | Option Meaning |
---|---|
-d, --delete | Remove (delete) entries from a zip archive. |
-e, --encrypt | Encrypt the contents of the zip archive using a password that is entered on the terminal. |
-F, --fix, -FF, --fixfix | Fix the zip archive. The -F option can be used if some portions of the archive are missing, but requires a reasonably intact central directory. |
-r, --recurse-paths | Travel the directory structure recursively. |
To create a zip file you specify the zip file name, followed by the files to include in the zip file:
pbmac@pbmac-server $ zip myfile.zip filename.txt otherfile.txt picture.jpg
Extracting files from zip file
Unzip will list, test, or extract files from a zip archive, commonly found on Unix systems. The default behavior (with no options) is to extract into the current directory (and sub-directories below it) all files from the specified zip archive.
pbmac@pbmac-server $ unzip myfile.zip
Adapted from:
"Data compression" by Multiuple Contributors, Wikipedia is licensed under CC BY-SA 3.0
"Gzip Command in Linux" by Shubrodeep Banerjee, Geeks for Geeks is licensed under CC BY-SA 4.0
"XZ Utils" by Multiuple Contributors, Wikipedia is licensed under CC BY-SA 3.0
"bzip2 command in Linux with Examples" by sarthak_ishu11, Geeks for Geeks is licensed under CC BY-SA 4.0
"ZIP command in Linux with examples" by Mohak Agrawal, Geeks for Geeks is licensed under CC BY-SA 4.0