Archiving, compressing, and decompressing files is a frequently used function, and we can do this with tools like tar and gzip. In Go, the standard libraries archive and compress provide us with these capabilities, and with this example, you will see that it is very easy to generate and handle compressed archives in a Go programming style.

Archiving and Compression

Before we start the code, we need to clarify the concepts of archiving and compression.

  • Archiving, which refers to a collection of files or directories that are stored in a file.
  • Compression, which refers to the use of algorithms to process files in order to retain the maximum file information while making them smaller.

Take the archiving tool tar for example, the files typed out by it are usually called tarball, and their file names usually end with .tar. The tarball is then compressed by other compression tools, such as gzip, to get a compressed file that usually ends in .tar.gz (you can use the -z argument in tar to invoke gzip).

A tarball is a collection of files whose structure is also made up of data segments, each of which contains a header (meta information describing the file) and the contents of the file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
+----------------------------------------+
| Header                                 |
| [name][mode][owner][group][size]  ...  |
+----------------------------------------+
| Content                                |
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+----------------------------------------+
| Header                                 |
| [name][mode][owner][group][size]  ...  |
+----------------------------------------+
| Content                                |
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+----------------------------------------+
| ...                                     |

archive library archiving and unarchiving

The archive library is used for archiving and unarchiving. It provides two options: tar and zip, and the paths to call them are archive/tar and archive/zip respectively.

Let’s take tar as an example to show how to archive and unarchive files.

First, create a new target archive file as out.tar, then construct some file data readme.txt, gopher.txt and todo.txt for archiving.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import (
 "archive/tar"
  ...

func main() {
 // Create and add some files to the archive.
 tarPath := "out.tar"
 tarFile, err := os.Create(tarPath)
 if err != nil {
  log.Fatal(err)
 }
 defer tarFile.Close()
 tw := tar.NewWriter(tarFile)
 defer tw.Close()
 var files = []struct {
  Name, Body string
 }{
  {"readme.txt", "This archive contains some text files."},
  {"gopher.txt", "Gopher names:\nGeorge\nGeoffrey\nGonzo"},
  {"todo.txt", "Get animal handling license."},
 }
 ... 
}  

Then the file header information is constructed in order, specifying the file name, permissions and size respectively (more header fields can be defined), and the WriteHeader and Write methods are called in order to write the data segments to be archived (header + file content) to the out.tar file via tw variables of type *tar.Writer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
 ...
 for _, file := range files {
  hdr := &tar.Header{
   Name: file.Name,
   Mode: 0600,
   Size: int64(len(file.Body)),
  }
  if err := tw.WriteHeader(hdr); err != nil {
   log.Fatal(err)
  }
  if _, err := tw.Write([]byte(file.Body)); err != nil {
   log.Fatal(err)
  }
 }
}

Executing the above code will result in an archived out.tar file, which can be viewed by specifying the -tvf parameter with the tar utility.

1
2
3
4
$ tar -tvf out.tar
-rw-------  0 0      0          38 Jan  1  1970 readme.txt
-rw-------  0 0      0          35 Jan  1  1970 gopher.txt
-rw-------  0 0      0          28 Jan  1  1970 todo.txt

As you can see, the specified file information (file name, permissions and size) is as expected, but other unspecified meta information is wrong, such as the date (the default value given directly).

If we use the tar utility, we can execute the following command to extract the files in out.tar.

1
2
3
4
$ tar -xvf out.tar
x readme.txt
x gopher.txt
x todo.txt

But what should be done to implement it in the program?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func main() {
 tarPath := "out.tar"
 tarFile, err := os.Open(tarPath)
 if err != nil {
  log.Fatal(err)
 }
 defer tarFile.Close()
 tr := tar.NewReader(tarFile)
 for {
  hdr, err := tr.Next()
  // End of archive
  if err == io.EOF {
   break
  }
  if err != nil {
   log.Fatal(err)
  }
  fmt.Printf("Contents of %s: ", hdr.Name)
  if _, err := io.Copy(os.Stdout, tr); err != nil {
   log.Fatal(err)
  }
  fmt.Println()
 }
}

// Output:
Contents of readme.txt: This archive contains some text files.
Contents of gopher.txt: Gopher names:
George
Geoffrey
Gonzo
Contents of todo.txt: Get animal handling license.

First, open out.tar and construct a tr variable of type *tar.Reader. After that, use tr.Next to extract the contents of each data segment in turn and copy the contents of the file to the standard output via io. Until tr.Next encounters io.EOF, which means that the end of the archive file has been read, the extraction is exited.

Compress library compression and decompression

The compress library supports several compression schemes, including bzip2, flate, gzip, lzw and zlib, and is called from compress/xxx.

Let’s take the commonly used gzip as an example to show the compression and decompression code.

If the same file data readme.txt, gopher.txt and todo.txt as above, we want to get the tar-archived and compressed out.tar.gz file, how should we do it?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import (
 "archive/tar"
 "compress/gzip"
 ...
)

func main() {
 tarPath := "out.tar.gz"
 tarFile, err := os.Create(tarPath)
 if err != nil {
  log.Fatal(err)
 }
 defer tarFile.Close()
 gz := gzip.NewWriter(tarFile)
 defer gz.Close()
 tw := tar.NewWriter(gz)
 defer tw.Close()
 ...
}

Very simple! Just change tar.NewWriter(tarFile) to tar.NewWriter(gz), where gz is derived from gzip.NewWriter(tarFile).

Comparing the size of the archived tarball with and without compression, we can see that the file size is compressed from 4.0K to 224B.

1
2
3
$ ls -alh out.tar out.tar.gz
-rw-r--r--  1 slp  staff   4.0K Jul  3 21:52 out.tar
-rw-r--r--  1 slp  staff   224B Jul  3 21:53 out.tar.gz

Similarly, if you want to uncompress and unarchive the out.tar.gz file, how should you do it?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main

import (
 "archive/tar"
 "compress/gzip"
  ...
)

func main() {
 tarPath := "out.tar.gz"
 tarFile, err := os.Open(tarPath)
 if err != nil {
  log.Fatal(err)
 }
 defer tarFile.Close()
 gz, err := gzip.NewReader(tarFile)
 if err != nil {
  log.Fatal(err)
 }
  defer gz.Close()
 tr := tar.NewReader(gz)
  ...
}

It’s still very simple! Just change tar.NewReader(tarFile) to tar.NewReader(gz), where gz is derived from gzip.NewReader(tarFile).

Summary

This article shows how to archive and unarchive files with the archive/tar package. How to further compress and decompress a tarball with the compress/gzip package.

When showing the use of compress/gzip, an additional layer of Writer/Reader is wrapped to add compression and decompression capabilities to the tar archive. Even better, if you want to switch between archiving/unarchiving and compressing/decompressing strategies, you can simply replace the corresponding Writer/Reader. This convenience comes from Go’s excellent streaming IO design.

Of course, it’s not easy to learn this on paper, but you have to do it yourself. For those who haven’t used the archive and compress libraries, you can try to use a scheme not used in this article to try to handle archived compressed files.