“Why Are Tar.xz Files 15× Smaller When Using Python’s Tar Library Compared to MacOS Tar?”, Saaru Lindestøkke2021-03-15 (, , ; backlinks)⁠:

I’m compressing ~1.3 GB folders each filled with 1440 JSON files and find that there’s a 15-fold difference between using the tar command on macOS or Raspbian 10 (Buster) and using Python’s built-in tarfile library…The output is:

…The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters…EDIT: Ok, I think I found the issue: BSD tar and GNU tar without any sort options put the files in the archive in an undefined order… I think the reason sorting has such an impact is as follows:

My JSON files contain measurements from hundreds of sensors. Every minute I read out all sensors, but only a few of these sensors have a different value from minute to minute. By sorting the files by name (which has the creation Unix time at the beginning of it), two subsequent files have very little different characters between them. Apparently this is very favourable for the compression efficiency.