Archiving / Compressing Files:tar , gzip and bzip2 on Linux December 3, 2010Posted by Tournas Dimitrios in Linux.
This article will just outline the basics about archiving and compressing files / directories on Linux , with a handful examples , just to get started with this very important concept in the Linux ” world ” .
Often, if a directory and its underlying files are not going to be used for a while, or if the entire directory tree is going to be transferred from one place or another, people convert the directory tree into an archive file. The archive contains the directory and its underlying files and subdirectories, packaged as a single file. In Linux (and Unix), the most common command for creating and extracting archives is the tar command.
Originally, archive files provided a solution to backing up disks to tape. When backing up a filesystem, the entire directory structure would be converted into a single file, which was written directly to a tape drive. The tar command derived its name from “t”ape “ar”chive.
Today, the tar is seldom used to write to tapes directly, but instead creates archive files which are often referred to as “tar files”, “tar archives”, or sometimes informally as “tarballs”. These archive files are conventionally given the .tar filename extension.
Archiving Files with tar :
The tar command can take a wide variety of arguments which specify and define the actions it will have on the particular set of files or the archive. The main types of arguments to tar are :
|-c, –create||Create an archive file|
|-x, –extract||Extract an archive file|
|-t, –list||List the contents of an archive file|
There are others, but almost always one of these three will suffice. See the tar(1) man page for more details.
Next, almost every invocation of the tar command must include the -f command line switch and its argument, which specifies which archive file is being created, extracted, or listed.
As an example, the root user has been working on a report, which involves several subdirectories and files.He would like to email a copy of the report to a friend. Rather than attach each individual file to an email message, he decides to create an archive of the report directory. He uses the tar command, specifying -c to “c”reate an archive, and using the -f command line switch to specify the archive file to create.
| |– chap1.html
| |– chap2.html
| `– figures/
| `– image1.png
The newly created archive file report.tar now contains the entire contents of the report directory, and its subdirectories. In order to confirm that the archive was created correctly, prince lists the contents of the archive file with the tar -t command (again using -f to specify which archive file).
[root@server ~]# tar -t -f report.tar
Creating archives introduces a lot of complicated questions, such as some of the following.
- When creating archives, how should links be handled? Do I archive the link, or what the link refers to?
- When extracting archives as root, do I want all of the files to be owned by root, or by the original owner? What if the original owner doesn’t exist on the system I’m unpacking the tar on?
- What happens if the tape drive I’m archiving to runs out of room in the middle of the archive?
The answers to these, and many other questions as well, can be decided with an overwhelming number of command line switches to the tar command, as tar –help or a quick look at the tar(1) man page will demonstrate. The following table lists some of the more commonly used switches, and there use will be discussed below.
|-C, –directory=DIR||Change to directory DIR|
|-P, –absolute-reference||don’t strip leading / from filenames|
|-v, –verbose||list files processed|
|-z, –gzip||internally gzip archive|
|-j, –bzip2||internally bzip2 archive|
Suppose the root user wanted to archive a snapshot of the current networking configuration of his machine. He might run a command like the following. (Note the inclusion of the -v command line switch, which lists each file as it is processed.)
As the leading message implies, what was an absolute reference to /etc/sysconfig/networking is converted to relative references inside the archive: None of the entries have leading slashes. As a rule, archive files will always unpack locally, reducing the chance that you will unintentionally clobber files in your filesystem by unpacking an archive on top of them. When constructing the archive, this behavior can be overridden with the -P command line switch.
When extracting the archive above, the first “interesting” directory is the networking directory, because it contains the relevant subdirectories and files. When extracting the archive, however, and “extra” etc and etc/sysconfig are created. In order to get to the interesting directory, someone has to work his way down to it.
When constructing an archive, the -C command line switch can be used to help establish context by changing directory before the archive is constructed. Compare the following two tar commands.
Compressing Files with gzip and bzip2 :
Files that are not used very often are often compressed. Large files are also compressed before transferring to other systems or users. The advantages of saved space and bandwidth usually outweighs the added time it takes to compress and uncompress files.
Text files often have patterns that can be compressed up to 75% but binary files rarely compress more than 25%. In fact, it is even possible for a compressed binary file to be larger than the original file!
The two most common compression utilities used in Linux are :
- The gzip command is the most versatile and most commonly used decompression utility. Files compressed with gzip are uncompressed with gunzip. Additionally, the gzip command supports the following command line switches.
Switch Effect -c Redirect Output to stdout -d Decompress instead of compress file -r Recurse through subdirectories, compressing individual files. -1 … -9 Specify trade off between CPU intensity and compression efficiency.
- The bzip2 command is a relative newcomer, which tends to produce the most compact compressed files, but is the most CPU intensive. Files compressed with bzip2 are uncompressed with bunzip2. The bzip2 command supports the following command line switches.
Switch Effect -c Redirect Output to stdout -d Decompress instead of compress file
Practical examples :
|tar cvf archive_name.tar dirname/||Creating an uncompressed tar archive using option cvf|
|tar cvzf archive_name.tar.gz dirname/||
Note: .tgz is same as .tar.gz
|tar cvfj archive_name.tar.bz2 dirname/||Creating a bzipped tar archive using option cvjf|
|tar xvf archive_file.tar /path/to/file||Extract a single file from tar, tar.gz, tar.bz2 file|
||Use the relevant option z or j according to the compression method gzip or bzip2 respectively .|
Note: You cannot add file or directory to a compressed archive. If you try to do so, you will get “tar: Cannot update compressed archives” error
|tar xvf archive_file.tar –wildcards ‘*.pl’||Extract group of files from tar, tar.gz, tar.bz2 archives using regular expression|
|tar -cf – /directory/to/archive/ | wc -c||Estimate the tar archive size|
More reading :
- info tar from your command line
- The on-line GNU tar manual