jump to navigation

Archiving / Compressing Files:tar , gzip and bzip2 on Linux December 3, 2010

Posted by Tournas Dimitrios in Linux.

This article will just outline the basics about archiving and compressing files / directories on Linux , with a handful examples , just to get started with this very important concept in the Linux ” world ” .

Often, if a directory and its underlying files are not going to be used for a while, or if the entire directory tree is going to be transferred from one place or another, people convert the directory tree into an archive file. The archive contains the directory and its underlying files and subdirectories, packaged as a single file. In Linux (and Unix), the most common command for creating and extracting archives is the tar command.

Originally, archive files provided a solution to backing up disks to tape. When backing up a filesystem, the entire directory structure would be converted into a single file, which was written directly to a tape drive. The tar command derived its name from “t”ape “ar”chive.

Today, the tar is seldom used to write to tapes directly, but instead creates archive files which are often referred to as “tar files”, “tar archives”, or sometimes informally as “tarballs”. These archive files are conventionally given the .tar filename extension.
Archiving Files with tar :
The tar command can take a wide variety of arguments which specify and define the actions it will have on the particular set of files or the archive. The main types of arguments to tar are :

Switch Effect
-c, –create Create an archive file
-x, –extract Extract an archive file
-t, –list List the contents of an archive file

There are others, but almost always one of these three will suffice. See the tar(1) man page for more details.

Next, almost every invocation of the tar command must include the -f command line switch and its argument, which specifies which archive file is being created, extracted, or listed.

As an example, the root user  has been working on a report, which involves several subdirectories and files.He would like to email a copy of the report to a friend. Rather than attach each individual file to an email message, he decides to create an archive of the report directory. He uses the tar command, specifying -c to “c”reate an archive, and using the -f command line switch to specify the archive file to create.
|– html/
| |– chap1.html
| |– chap2.html
| `– figures/
| `– image1.png
`– text/
|– chap1.txt
`– chap2.txt

The newly created archive file report.tar now contains the entire contents of the report directory, and its subdirectories. In order to confirm that the archive was created correctly, prince lists the contents of the archive file with the tar -t command (again using -f to specify which archive file).
[root@server ~]# tar -t -f report.tar

Creating archives introduces a lot of complicated questions, such as some of the following.

  • When creating archives, how should links be handled? Do I archive the link, or what the link refers to?
  • When extracting archives as root, do I want all of the files to be owned by root, or by the original owner? What if the original owner doesn’t exist on the system I’m unpacking the tar on?
  • What happens if the tape drive I’m archiving to runs out of room in the middle of the archive?

The answers to these, and many other questions as well, can be decided with an overwhelming number of command line switches to the tar command, as tar –help or a quick look at the tar(1) man page will demonstrate. The following table lists some of the more commonly used switches, and there use will be discussed below.

Switch Effect
-C, –directory=DIR Change to directory DIR
-P, –absolute-reference don’t strip leading / from filenames
-v, –verbose list files processed
-z, –gzip internally gzip archive
-j, –bzip2 internally bzip2 archive

Absolute References
Suppose the root user wanted to archive a snapshot of the current networking configuration of his machine. He might run a command like the following. (Note the inclusion of the -v command line switch, which lists each file as it is processed.) 

As the leading message implies, what was an absolute reference to /etc/sysconfig/networking is converted to relative references inside the archive: None of the entries have leading slashes.  As a rule, archive files will always unpack locally, reducing the chance that you will unintentionally clobber files in your filesystem by unpacking an archive on top of them. When constructing the archive, this behavior can be overridden with the -P command line switch.
Establishing Context
When extracting the archive above, the first “interesting” directory is the networking directory, because it contains the relevant subdirectories and files. When extracting the archive, however, and “extra” etc and etc/sysconfig are created. In order to get to the interesting directory, someone has to work his way down to it.

When constructing an archive, the -C command line switch can be used to help establish context by changing directory before the archive is constructed. Compare the following two tar commands.
Compressing Files with gzip and bzip2  :

Files that are not used very often are often compressed. Large files are also compressed before transferring to other systems or users. The advantages of saved space and bandwidth usually outweighs the added time it takes to compress and uncompress files.

Text files often have patterns that can be compressed up to 75% but binary files rarely compress more than 25%. In fact, it is even possible for a compressed binary file to be larger than the original file!

The two most common compression utilities used in Linux are :

  • The gzip command is the most versatile and most commonly used decompression utility. Files compressed with gzip are uncompressed with gunzip. Additionally, the gzip command supports the following command line switches.
    Switch Effect
    -c Redirect Output to stdout
    -d Decompress instead of compress file
    -r Recurse through subdirectories, compressing individual files.
    -1 … -9 Specify trade off between CPU intensity and compression efficiency.
  • The bzip2 command is a relative newcomer, which tends to produce the most compact compressed files, but is the most CPU intensive. Files compressed with bzip2 are uncompressed with bunzip2. The bzip2 command supports the following command line switches.
    Switch Effect
    -c Redirect Output to stdout
    -d Decompress instead of compress file

Practical examples :

Prcatical examples
tar cvf archive_name.tar dirname/ Creating an uncompressed tar archive using option cvf
tar cvzf archive_name.tar.gz dirname/
  • The above tar cvf option, does not provide any compression. To use a gzip compression on the tar archive, use the z option

Note: .tgz is same as .tar.gz

tar cvfj archive_name.tar.bz2 dirname/ Creating a bzipped tar archive using option cvjf
  • tar xvf archive_name.tar
  • tar xvfz archive_name.tar.gz
  • tar xvfj archive_name.tar.bz2
  • Extract a tar file using option x as shown
  • Extract a gzipped tar archive ( *.tar.gz ) using option xvzf
  • Extracting a bzipped tar archive ( *.tar.bz2 ) using option xvjf
  • tar tvf archive_name.tar
  • tar tvfz archive_name.tar.gz
  • tar tvfj archive_name.tar.bz2
  • You can view the *.tar file content before extracting
  • You can view the *.tar.gz file content before extracting
  • You can view the *.tar.bz2 file content before extracting
tar xvf archive_file.tar /path/to/file Extract a single file from tar, tar.gz, tar.bz2 file
  • tar xvfz archive_file.tar.gz /path/to/file
  • tar xvfj archive_file.tar.bz2 /path/to/file
Use the relevant option z or j according to the compression method gzip or bzip2 respectively .
  • tar rvf archive_name.tar newfile
  • tar rvf archive_name.tar newdir/
  • Adding a file or directory to an existing archive using option -r
  • Adding a directory to the tar is also similar,

Note: You cannot add file or directory to a compressed archive. If you try to do so, you will get “tar: Cannot update compressed archives” error

tar xvf archive_file.tar –wildcards ‘*.pl’ Extract group of files from tar, tar.gz, tar.bz2 archives using regular expression
tar -cf – /directory/to/archive/ | wc -c Estimate the tar archive size

More reading :


No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s