192

Context

I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar command on macOS or Raspbian 10 (Buster) and using Python's built-in tarfile library.

Minimal working example

This script compares both methods:

#!/usr/bin/env python3

from pathlib import Path
from subprocess import call
import tarfile

fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11")
zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz")
py_out = Path(fullpath.parent, "py-archive.tar.xz")

# tar using terminal
# tar cJf zsh-archive.tar.xz folderpath
call(["tar", "cJf", zsh_out, fullpath])

# tar using tarfile library
with tarfile.open(py_out, "w:xz") as tar:
    tar.add(fullpath, arcname=fullpath.stem)

# Print filesizes
print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB")
print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB")

The output is:

zsh tar filesize: 23.7 MB
py tar filesize: 1.49 MB

The versions I use are as follows:

  • tar on macOS: bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.6
  • tar on Raspbian 10: xz (XZ Utils) 5.2.4 liblzma 5.2.4
  • tarfile Python library: 0.9.0

Things I've tried

After compression, I've extracted both archives and compared the resulting folder with:

diff -r py-archive-expanded zsh-archive-expanded

There was no difference.

If I compare the two tar archives directly, they seem different:

➜ diff zsh-archive.tar.xz py-archive.tar.xz
Binary files zsh-archive.tar.xz and py-archive.tar.xz differ

If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way:

Left is zsh-archive.tar.xz, right is py-archive.tar.xz:

Enter image description hereenter image description here

The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters.

Question

What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation?

10
  • 2
    Did you make sure the result of tar cJf is actually xz-compressed? xz also uses LZMA but it is a distinct format from, say, 7-zip. Try file the-archive.tar.xz. – Daniel B Mar 13 at 19:20
  • file zsh-archive.tar.xz gives zsh-archive.tar.xz: XZ compressed data – Saaru Lindestøkke Mar 13 at 20:23
  • 2
    Did you actually tar up the same directory tree in both cases? Just making sure ;-) – tink Mar 13 at 20:39
  • 3
    Hm, okay. Please verify whether the uncompressed .tar files are the same. Files may have been added in a different order, which creates a different compression result. – Daniel B Mar 13 at 21:17
  • 1
    @tink, yes I do. I've added a testscript in my question that shows the same directory being compressed generating the wildly different filesize. – Saaru Lindestøkke Mar 13 at 22:34
249

Short answer: yes, it is safe to use Python tarlib to compress the data, nothing is lost compared to BSD tar.

Underlying issue: sorting

I think the underlying issue is that BSD tar and GNU tar without any sort options put the files in the archive in an undefined order.

GNU tar has a --sort option:

sort directory entries according to ORDER, which is one of none, name, or inode.
The default is --sort=none, which stores archive members in the same order as returned by the operating system.

Testing GNU tar

To test this I installed GNU tar on my Mac with:

brew install gnu-tar

And then tarred the same folder, but with the --sort option:

gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11

The zsh-archive-sorted.tar.xz archive is 1.5 MB, equal to the size of the archive created by the Python library.

Concatenating in sorted order

The effect sorting has on the final archive size is further demonstrated by first concatenating all the JSON files sorted by name (which has the creation unixtime at the beginning of it) and then tarring with BSD tar:

cat *.json > all.txt
tar cJf zsh-cat-archive.tar.xz all.txt

The zsh-cat-archive.tar.xz archive is also 1.5 MB.

Python tarfile sorting

Finally, the documentation of the Python TarFile.add function confirms that Python tarfile sorts by default:

Directories are added recursively by default. This can be avoided by setting recursive to False. Recursion adds entries in sorted order.

Why sorting matters

I think the reason sorting has such an impact in my case is as follows:

My JSON files contain locations of hundreds of vehicles. Every minute I read out all the locations, but only a few of these locations have a different value from minute to minute.
By sorting the files by name, two subsequent files have little different characters between them. Apparently this is very favourable for the compression efficiency.

14
  • 6
    Compression programs operate on blocks of text controlled by a single dictionary; by sorting the input, you've put similar bits near each other, allowing xz to compress lots of similar data in one dictionary. Compression and decompression was probably also faster. – RonJohn Mar 14 at 4:08
  • 49
    Wow, another case where sorting makes things much faster. – justhalf Mar 14 at 7:41
  • 5
    I don't really understand yet why the OS returns the files in "unsorted" order with the sort=none option. I mean, there's always some sort order, right? If anyone knows what order the OS uses feel free to add. – Saaru Lindestøkke Mar 14 at 9:22
  • 22
    TL:DR: "unsorted" means use dir entries in the order we get them from the OS's system call, which you can see with ls -U. – Peter Cordes Mar 14 at 11:18
  • 13
    Wowow! You know, this makes so much sense on such a basic level I commend you for discovering this.The idea that sorting text files in some way would improve compression seems so damned obvious when stated, but not obvious if one has not had experience with it. Excellent answer! – Giacomo1968 Mar 14 at 22:10
5

Try setting the compression levels in the macOS command line.

I know you are asking about xz but explained in this answer here, on older versions of GZip you can set the compression level with an environment variable like this:

GZIP=-9 tar cf zsh-archive.tar.xz folderpath

That said, that only seems to work with GZip 1.8 and is depreciated on later versions. So use the -I/--use-compress-program=COMMAND option for tar instead; note this option might not work on macOS but placing here anyway just in case. So the command would then change to:

tar -I 'gzip -9' -cf zsh-archive.tar.xz folderpath

And yes, these examples would be compressing the archive Gzip instead of xz, but you can easily change the command to this to use xz like this:

tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath

The xz compression level ranges from -0 to -9 with the default being -6; so -9 is the highest compression level.

Just note that xz is not installed on macOS by default. To install it on macOS you must first install Homebrew and then install xz via Homebrew like this:

brew install xz
8
  • 1
    I tried the command tar -I 'xz -9' -cf zsh-archive.tar.xz folderpath, but I get the following error: tar: Couldn't open xz -9: No such file or directory – Saaru Lindestøkke Mar 13 at 18:02
  • In macOS? I busted checked and it seems to be provided on my system by Homebrew. So I would recommend installing Homebrew and then running: brew install xz – Giacomo1968 Mar 13 at 18:05
  • 1
    Yes, on macOS. man tar shows the -I option is a synonym for the -T option, which is the --files-from option. I've tried it with the longhand option --use-compress-program which resulted in a 10 MB file, instead of the regular 23 MB, but it's still not near the 1.5 MB from Python. – Saaru Lindestøkke Mar 13 at 18:09
  • 1
    Note that I've tried this in the raspbian terminal as well, with similar results to what I get on macOS. – Saaru Lindestøkke Mar 13 at 18:17
  • 2
    @Giacomo1968 I just now realized that the -I option is GNU tar only. The fact that it was missing on my BSD tar should've been an indicator that something was off. – Saaru Lindestøkke Mar 16 at 17:09
4

Makes me wonder what Python is using for compression

http://tukaani.org/xz/

It's probably using the function calls in liblzma. Tar is probably piping through the xz shell command.

A quick comment on --sort=name:

The sort option is a relatively recent enhancement to GNU tar and was introduced in tar version 1.28.

It may never be implemented in BSD tar.

4
  • 1
    This answer is interesting. But it is weird that the focus of this answer is not the question, but rather a comment I made to my own answer just musing about why this happened. – Giacomo1968 Mar 15 at 0:12
  • bsdtar can be provided a sort list of files to include to work around this issue. As can every other tar program I've ever used (though the syntax might differ). – Michael J. Evans Mar 15 at 1:30
  • Good point on recency. 7z (p7zip) has for many years sorted files (by type) to improve compression; it's nice to see some basic version of this in tar, which is much more compatible with many other things. – Nemo Mar 15 at 7:56
  • @MichaelJ.Evans : and when one use (even a very old non gnu) : tar ... *something # the shell expands *something and it result in an ordered list of files that tar will include sequentially in that (sorted by the shell) order. It should uses LC_Collate order of the shell in which you launch that tar command, by default – Olivier Dulac Mar 15 at 13:31

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.